CN111831802B

CN111831802B - Urban domain knowledge detection system and method based on LDA topic model

Info

Publication number: CN111831802B
Application number: CN202010497669.1A
Authority: CN
Inventors: 盛浩; 李东霖; 杨达; 崔正龙; 王思哲
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2023-05-26
Anticipated expiration: 2040-06-04
Also published as: CN111831802A

Abstract

The invention relates to an urban domain knowledge detection system and method based on an LDA topic model, which are used for generating a domain knowledge report of a certain domain of a smart city; the system adopts a C/S architecture, adopts modularized system design, and comprises an authentication and management module, a file uploading module, a data cleaning module, an emotion analysis module, a word segmentation and part-of-speech tagging module, a keyword extraction module, a clustering module and a data transmission module. The method fully utilizes large-scale domain knowledge data generated in the urban informatization process, can automatically generate the urban domain knowledge report for the user, is further applied to the fields such as traffic early warning, public opinion monitoring and the like, has good design, stable operation, can be used without deploying an environment, and has stronger practical application value. The modules are all updateable and replaceable and can be adapted for future maintenance, improvement and expansion as may be required. Aiming at the personalized needs of different users, each module provides a customizable parameter or file, so that the system not only meets the generalized needs of common users, but also meets the customized needs of professional users.

Description

Urban domain knowledge detection system and method based on LDA topic model

Technical Field

The invention relates to a city domain knowledge detection system and method based on an LDA topic model, in particular to a city domain knowledge detection system and method based on an LDA topic model, and belongs to the field of cross application of big data and natural language processing.

Background

With the improvement of hardware computing power and the perfection of software algorithms, computers have been able to process massive amounts of data in a short time. In the cloud computing era, it has become possible to monitor and manage cities with big data, which is a currently hot topic. On this basis, the concept of "smart city" is proposed. The smart city is a high-level form of city informatization by fully applying the novel information technology to various geographic positions in the city, various fields and various industries.

The exchange and sharing of information is a major activity in smart cities. Business data, traffic information, social talk of residents and the like in the city are all information and are collectively called as domain knowledge of the city. The domain knowledge model is a core concept for describing the information system of the smart city, and comprises all information of entities, services, events and the like in the smart city.

The application of the domain knowledge in the smart city is difficult because the domain knowledge data in the smart city has the characteristics of large total amount, wide sources and miscellaneous information. If the processing is performed manually, time and labor are wasted, correctness cannot be guaranteed, and latest data cannot be followed in time. Therefore, the invention provides a city domain knowledge detection method based on an LDA topic model, designs a perfect city domain knowledge detection platform, utilizes natural language processing technology to process domain knowledge, eliminates low-quality information, extracts key information to generate a domain knowledge report, provides accurate city domain knowledge for users, and is further applied to the fields such as traffic early warning, public opinion monitoring and the like.

The urban domain knowledge detection technology based on the LDA topic model aims at taking urban big data information as a data source, extracting urban domain knowledge by using high efficiency, high accuracy and low cost of a computer data processing technology, generating an urban domain knowledge report, and enabling a user to acquire key information and be used in the advanced field.

Disclosure of Invention

The invention solves the technical problems: aiming at the situation that the data are more and more in the smart city construction, the urban domain knowledge detection system and method based on the LDA theme model are provided, the data can be rapidly and accurately processed, so that a user can acquire the key urban domain knowledge and can be pertinently used in the fields of traffic early warning, public opinion monitoring and the like.

The invention relates to a city domain knowledge detection system based on an LDA topic model, which is based on large-scale domain knowledge generated in the city informatization process, extracts key information by using a natural language processing technology to generate a city domain knowledge report, provides a keyword extraction algorithm based on the LDA topic model, and can extract key phrases containing the city domain knowledge information by combining the characteristics of the city domain knowledge and the structure of Chinese language;

the system comprises a client and a server; the method comprises the steps of terminating an authentication and management module, a file uploading module and a data transmission module at a client; the method comprises the steps of terminating an authentication and management module, a data cleaning module, an emotion analysis module, a word segmentation and part of speech tagging module, a keyword extraction module, a clustering module and a data transmission module at a server, wherein:

client authentication and management module: firstly, acquiring user identity information and interacting with a server authentication and management module to verify the legitimacy of the user identity, wherein only legal users are allowed to enter an operation interface to perform subsequent operations; secondly, a file uploading module and a data transmission module of the client are managed, and interaction with the data transmission module of the server is controlled;

And a client file uploading module: the data transmission module is used for uploading data files to be processed to the client data transmission module by a user and supporting the suffix of ". Xls", ". Xlsx", ". Csv" and ". Txt" files converted into ". Csv"; the client file uploading module comprises a visual interface to enable uploaded file contents to be visible, a user declares the corresponding contents of each column, and at most 20 variables are supported to meet the filtering requirements of the user on different columns; the client file uploading module further comprises a crawling sub-module for crawling urban domain knowledge information of the corresponding website, and the crawling sub-module is used as supplement of an input file or directly used as input;

client data transmission module: the server-side data transmission module is used for encapsulating the client-side data information and sending the client-side data information to the server-side, and analyzing the information received from the server-side data transmission module; the client data information comprises user identity information of an identity verification module, a data file of a file uploading module and various parameters and dictionary files which are user-defined and supported by each module of a server, and data transmission is based on TCP/IP so as to ensure the reliability of transmission. Above this, FTP-based for file transfer to ensure transfer efficiency;

Server authentication and management module: firstly, maintaining a user database, verifying the user identity of a client side initiating a connection request, returning notification information and distributing a working space for the user if the verification is successful; secondly, a data cleaning module, an emotion analysis module, a word segmentation and part-of-speech tagging module, a keyword extraction module, a clustering module and a data transmission module at a server side are managed, and interaction with the data transmission module at a client side is controlled;

server-side data cleaning module: the method comprises the steps of performing data cleaning on received original data to obtain cleaned data so as to improve data quality; the server-side data cleaning module comprises three sub-modules: the system comprises a repeated data cleaning module, a non-value data cleaning module and a special symbol cleaning module; the repeated data refer to data with the same characters or similar semantics, the worthless data refer to data irrelevant to domain knowledge to be extracted by a user, and the special symbols refer to messy codes generated by different encoding or decoding modes or special symbols irrelevant to the domain knowledge. The repeated data cleaning module receives cleaning parameters set by a user and cleans the repeated data according to the user requirement; the non-value data cleaning module receives a pattern character string input by a user and cleans the non-value data according to a matching rule; the special symbol cleaning module maintains a special symbol library which contains various special symbols commonly used on a network, and performs stronger filtering on data according to the special symbol library;

The server-side word segmentation and part-of-speech tagging module: the method comprises the steps of performing word segmentation and part-of-speech tagging on cleaned data, wherein the data is segmented into words and tagged with the part-of-speech; the server-side word segmentation and part-of-speech tagging module maintains a Chinese dictionary, the Chinese dictionary contains almost all Chinese words and part-of-speech annotations of the Chinese words, each piece of data input into the server-side word segmentation and part-of-speech tagging module is scanned according to a bidirectional maximum matching algorithm, and each piece of data is segmented into words and tagged with part-of-speech information; the server-side word segmentation and part-of-speech tagging module also receives a user-defined word segmentation dictionary uploaded by a user, and replaces or supplements a default dictionary to carry out word segmentation and part-of-speech tagging so as to meet the word segmentation requirements of different fields; finally, data after word segmentation and part-of-speech tagging are obtained;

the server-side keyword extraction module: extracting keywords from the data after word segmentation and part-of-speech tagging; filtering parts of speech through a keyword extraction algorithm based on an LDA topic model, screening candidate keywords, carrying out preliminary weighting on the candidate keywords through an inverse TF-IDF algorithm, carrying out weighting on the candidate keywords based on the LDA model, calculating a weighted weight as the total weight of each candidate keyword, extracting the keywords with the highest weight and generating key phrases; the LDA is totally called Latent Dirichlet Allocation, namely, implicit dirichlet allocation, the LDA model is a topic model, word allocation and topic allocation of each piece of data are obtained through the LDA model, and then weights are obtained through calculating cosine similarity; the keyword extraction algorithm based on the LDA topic model is realized as follows:

(1) Part-of-speech filtering; for the corpus subjected to word segmentation and part-of-speech tagging, part-of-speech filtering is carried out firstly, candidate keywords are screened out, and only words with part-of-speech in the corpus are reserved after part-of-speech filtering;

(2) Preliminary weighting of keywords based on an inverse TF-IDF algorithm; an inverse TF-IDF algorithm is adopted to endow each candidate keyword with an initial weight, wherein the initial weight of the candidate keywords is as follows:

Weight(i,iTF-IDF)＝tf(i)×iidf(i)×length(i)

in the above formula, tf (i) represents the occurrence frequency of the word i in the corpus; iidf (i) is the reciprocal of idf (i), and measures the frequency of occurrence of word i in the corpus; length (i) is the length of word i; the initial weight value is the product of the three;

(3) Based on keyword weighting of the LDA topic model, the LDA model can acquire topic distribution and word distribution of each piece of data, the corpus in the step (1) is input into the trained LDA model to obtain probability distribution of each word on the topic and probability distribution of each corpus on the topic, and then the relevance between the words and the corpus are calculated through cosine similarity; keyword weights based on the LDA topic model are defined as:

Weight(i,LDA)＝α×Σsim(i,j)+(1-α)×sim(i,p)

wherein sim (i, j) represents the relativity between word i and word j, sim (i, p) represents the relativity between word i and corpus p, and the weighting coefficient alpha is set by the user by himself, and defaults to 0.2;

(4) And (5) calculating final weights of keywords: finally, the weight of each candidate keyword is defined as the weighted sum of the keyword weight value based on the inverse TF-IDF algorithm and the keyword weight value based on the LDA topic model, namely:

Weight(i)＝λ×weight(i,iTF-IDF)+(1-λ)×weight(i,LDA)

wherein λ is a weighting coefficient; after calculating a final weight value, extracting a specific keyword for each corpus according to the size of the weight value;

(5) Keyword expansion: the extracted keywords are expanded to generate key phrases, the co-occurrence frequency among the keywords in each corpus is calculated for the extracted keywords, namely the co-occurrence frequency is calculated, namely the co-occurrence frequency is simultaneously present in the same text, the distance is not more than 8 bytes, if a certain threshold value beta is reached and the part-of-speech combination specified by a user is met, the keywords are expanded to be the key phrases, wherein beta is set by the user, and the keyword is selected to be 3 through a large number of repeated tests.

Server-side clustering module: for clustering similar key phrases; training a word2vec model in advance during clustering, converting all input key phrases into word2vec word vectors, calculating similarity between the word vectors, and clustering semantically similar key phrases according to a similarity threshold set by a user; after clustering is completed, counting word frequencies of the key phrases and sequencing, generating a city domain knowledge report and returning the city domain knowledge report to the client; the urban domain knowledge, namely the informationized data of each urban domain, comprises traffic flow data, electric business comment data, social network behavior data and all the urban domain information which has a certain data size and can be collected.

Server-side data transmission module: the server-side data information comprises the authentication and management module, the data cleaning module, the emotion analysis module, the word segmentation and part of speech tagging module, the processing information returned by the keyword extraction module and the generated intermediate file, and the final clustering result generated by the clustering module.

Server-side emotion analysis module (optional): for the data cleaned by the server-side data cleaning module, if the data is text data and the text emotion classification is required, a server-side emotion analysis module is adopted; the server-side emotion analysis module is used for carrying out emotion analysis on the text data, classifying the text data into two types of positive emotion and negative emotion, and facilitating a user to select different types of data for subsequent operation according to self needs; the emotion analysis module maintains an emotion dictionary constructed based on the How net (knowledge net, large language knowledge base), calculates the emotion intensity of each piece of text data input into the emotion analysis module according to a weighted average algorithm, and classifies the text data according to a received threshold value set by a user; the emotion analysis module also receives the custom emotion dictionary uploaded by the user to replace or supplement the default dictionary for emotion analysis.

The method comprises the following specific steps:

(1) After the user opens the client, the authentication and management module inputs the user name and the password in the input box according to the prompt, the information is transmitted to the client data transmission module after clicking the 'determination', the client data transmission module establishes TCP connection with the server data transmission module, and the information is uploaded to the server. The server side identity verification module extracts information from the server side data transmission module and searches a user database, if the user database is successfully searched, the client side returns prompt information of successful login and enters an operation interface, and the server side allocates a working space according to user rights; if the searching fails, the client returns a prompt message of 'user name or password error, please input again', and reloads the authentication and management module.

(2) The user needs to input a file containing a large amount of knowledge information in the urban area, and two options are provided for the operation interface: firstly, the built-in crawler program is used for crawling the corresponding urban domain knowledge from the appointed webpage, and secondly, the collected urban domain knowledge file is directly uploaded from the local. The uploaded file reaches a data cleaning module at the server side through a data transmission module to carry out the next operation.

(3) In the data cleaning stage, the system can clean the data according to the needs of users, filter the junk data (the data irrelevant to the field or affecting the extraction of key information) as much as possible, and retain the data expected by the users, and the data cleaning stage specifically comprises repeated data cleaning, worthless data cleaning and special symbol cleaning. Repeated data cleansing is to cleanse those data that are completely identical, reducing redundancy. The user may specify a range of cleansing, such as cleansing data that is completely consistent over a selected period of time or cleansing by data source, etc. The non-valuable data refers to data that has no meaning, such as consecutive numbers or letters appearing in the comments of the electric users, etc., and the user can designate the corresponding pattern character string to clean. Special symbol cleaning can process out some network special characters and can further reduce data. If the data after data cleaning is text data and there is a need for text emotion classification, the data can be imported into an emotion analysis module for next operation, otherwise, a word segmentation and part of speech tagging module is imported.

(4) In the emotion analysis stage, the system classifies the data into two types of positive emotion and negative emotion according to the data content, and the classification basis is as follows: if a data expresses positive emotion, its emotional tendency is positive (positive); if a data expresses negative emotions, its emotional tendency is negative (negative). The method is converted into two classification problems in the text emotion analysis field, a classical semantic-based emotion dictionary method is adopted in the system, an emotion dictionary is built on the basis of a Chinese language knowledge base How net, the emotion intensity of each text data is calculated by using a weighted average algorithm, and emotion tendencies of the text data are judged according to a set threshold value. The text data after emotion analysis is classified into two sub-data blocks of positive emotion and negative emotion, a user can decide which block to select for further processing, and can also select all the sub-data blocks, the classification aims to meet different requirements of the user (such as user comment data of an e-commerce platform product, general good comments reflect the advantages and characteristics of the product, poor comments reflect the defects of the product, the user can choose to know the strong items of the product to further strengthen or complement the weak points of the product), and the text data after emotion analysis is led into a word segmentation and part of speech tagging module for next operation.

(5) In the stage of word segmentation and part-of-speech tagging, a text sentence is segmented into basic expression units of Chinese, namely words, a word segmentation algorithm is a word segmentation method based on character string matching, namely, a Chinese character string to be analyzed is matched with a set Chinese dictionary according to a certain strategy, and if matching is successful, a word is segmented. The matching method adopted by the system is a bidirectional maximum matching method, namely, character strings are scanned from left to right and from right to left, and word segmentation accuracy is improved as much as possible. The word segmentation dictionary stores word information and part-of-speech information, and marks part-of-speech information such as nouns, adjectives and the like while segmenting words. The text data after word segmentation and part-of-speech tagging is imported into a key word extraction module for next operation.

(6) In the keyword extraction stage, the word segmentation corpus is input into a keyword extraction module, a specific keyword is extracted by calculating the weight value of each candidate keyword through a keyword extraction algorithm based on an LDA topic model, and a keyword phrase is generated. The keyword extraction algorithm based on the LDA topic model is realized as follows:

the keyword extraction algorithm based on the LDA topic model in the keyword extraction module is to screen out keywords in a text by calculating and comparing weight values of candidate keywords. It mainly comprises 5 parts, respectively:

(a) Part-of-speech filtering. Corpus subjected to word segmentation and part-of-speech tagging is input into a keyword extraction module, part-of-speech filtering is performed first, and candidate keywords are screened out. Because the keyword parts of domain knowledge in different domains are different, the user can select the screened part of speech according to own needs. For example, for the application of product quality evaluation of an e-commerce platform, valuable candidate words are nouns for describing product components and adjectives for describing product performance and evaluation, so that only nouns and adjectives in corpus are reserved for the next operation after part-of-speech filtering.

(b) And (5) primarily weighting keywords based on an inverse TF-IDF algorithm. TF-IDF is a statistical method used to evaluate how important a word is to a document in a corpus. For the specific application of the patent, an inverse TF-IDF algorithm is provided, and an initial weight is given to each candidate keyword. Initial weight value of candidate keywords:

Weight(i,iTF-IDF)＝tf(i)×iidf(i)×length(i)

in the above formula, tf (i) represents the occurrence frequency of the word i in the corpus, iidf (i) is the reciprocal of idf (i), the occurrence frequency of the word i in the corpus can be measured, length (i) is the length of the word i, and the initial weight value is the product of the three. Unlike the conventional TF-IDF algorithm, the higher the frequency of occurrence of word i in the corpus, the higher the initial weight value.

(c) Keyword weighting based on LDA topic model. The LDA topic model is a generative probabilistic model that defines each corpus in the corpus as a random mix of implicit topic sets, so that the corpus can be transformed into a set of implicit topics. Inputting the corpus in (a) into a trained LDA model to obtain probability distribution of each word on a topic and probability distribution of each corpus on the topic, and calculating the relevance between words and the corpus through cosine similarity, so that keyword weights based on the LDA topic model are defined as follows:

Weight(i,LDA)＝α×Σsim(i,j)+(1-α)×sim(i,p)

wherein sim (i, j) represents the relativity between word i and word j, sim (i, p) represents the relativity between word i and corpus p, and the weighting coefficient alpha can be set by the user himself, and the invention is selected to be 0.2 through a large number of repeated tests.

(d) And calculating the final weight of the keywords. Finally, the Weight of each candidate keyword is defined as the weighted sum of the keyword Weight value based on the inverse TF-IDF algorithm and the keyword Weight value based on the LDA topic model, namely Weight (i) =lambda×weight (i, iTF-IDF) + (1-lambda) ×weight (i, LDA)

Wherein the weighting coefficient lambda can be set by the user by himself, and the invention is selected to be 0.15 through a plurality of repeated tests. And after calculating the final weight value, extracting specific keywords for each corpus according to the size of the weight value.

(e) Keyword expansion. The phrase can contain more information than the word, and thus the extracted keywords are expanded to generate a key phrase. The grammar structure of key phrases in different fields is different, for product comments on an e-commerce platform, the phrase containing product quality evaluation information is generally a grammar structure of 'noun' + 'adjective', so if the user processes the comments of the user of the e-commerce platform, the co-occurrence frequency (namely the frequency of simultaneous occurrence in the same text and the distance of no more than 8 bytes) between the keywords in each corpus is calculated for the keywords extracted in the last step, and if a certain threshold beta is reached and the combined structure of 'noun' and 'adjective' is met, the key phrases are expanded. Where β can be set by the user himself, defaulting to 3.

The keyword extraction algorithm can well extract keywords in text sentences and generate key phrases, the key phrases can store valuable information in original texts to the greatest extent, and the generated key phrases are imported into the clustering module for next operation.

(7) In the clustering stage, all key phrases are converted into word2vec word vectors, the similarity between the vectors is calculated, if the similarity is larger than a threshold set by a user, the key phrases are clustered into the same phrase, after all the key phrases are clustered, word frequencies are counted and ordered, and the generated urban domain knowledge information is returned to a client.

Compared with the prior art, the invention has the advantages that:

(1) The invention is based on large-scale domain knowledge generated in the urban informatization process, extracts key information by using a natural language processing technology, generates an urban domain knowledge report, designs an interactive urban domain knowledge detection system, provides a high-efficiency and high-accuracy automation platform, and saves manpower and material resources compared with the traditional method.

(2) The invention provides a keyword extraction algorithm based on an LDA topic model, which can extract key phrases containing urban domain knowledge information by combining the characteristics of urban domain knowledge and the structure of Chinese language.

(3) The invention adopts a C/S architecture, and a large amount of data processing work is transmitted to a high-performance server for execution, so that a user can complete quality assessment without configuring a working environment, and the transmission between the client and the server is based on TCP/IP protocol, thereby ensuring the reliability.

(4) Each module in the data processing flow provides a user with a customizable option, and the user can configure some parameters and files according to own needs and the characteristics of the processed data so as to meet the personalized needs.

(5) The invention adopts a modularized design thought, so that the system maintenance and the function expansion are very convenient, the maintenance of a certain module does not influence the work of other modules, and the addition of new functions does not influence the use of the existing functions.

(6) Because the main operations of the invention are all deployed on the server, hot update is supported, and the processing accuracy can be continuously improved along with technology iteration without the need of active operation of a user.

Drawings

FIG. 1 is a block diagram of a city domain knowledge detection platform based on an LDA topic model;

FIG. 2 is a block diagram of a data cleansing module according to the present invention;

FIG. 3 is a block diagram of an emotion analysis module of the present invention;

FIG. 4 is a block diagram of a word segmentation and part-of-speech tagging module of the present invention;

fig. 5 is a flowchart of the keyword extraction module algorithm of the present invention.

Detailed Description

The following is a further description of embodiments of the invention, taken in conjunction with the accompanying drawings:

as shown in FIG. 1, the structure diagram of the urban domain knowledge detection platform based on the LDA topic model is shown. The left side is a client structure diagram and consists of a UI layer, an authentication and management module, a file uploading module and a data transmission module. The UI layer provides visualization, and is convenient for a user to directly operate. The authentication and management module is not only responsible for authentication work of user identity, but also responsible for management of each module of the client, and interaction with the data transmission module is realized. The interaction information comprises user identity authentication information, user-defined parameters and files provided by a user for each module of work, result information returned by a server side and the like. The file uploading module opens an interface to enable a user to upload files of a specified type for processing, wherein a crawler submodule is also attached to the file uploading module and can be used as supplement of input. The data transmission module bears the interaction with the server end, and the transmission reliability is ensured by adopting a TCP/IP protocol. The right side is a server-side structure diagram and consists of a UI layer, an authentication and management module, a data cleaning module, an emotion analysis module, a word segmentation and part-of-speech tagging module, a keyword extraction module, a clustering module and a data transmission module. Wherein the UI layer provides the visualized operations for the server administrator. The authentication and management module interacts with the authentication and management module of the client to jointly complete authentication work of the user identity, and is responsible for managing each module of the server to realize interaction with the data transmission module. The data cleaning module is responsible for cleaning the original data, removing junk data and information, and improving the accuracy and the credibility of a final result. The emotion analysis module is responsible for carrying out text emotion analysis on text data to be classified, classifying the text data into positive emotion or negative emotion, and meeting the detection requirements of different field knowledge. The word segmentation and part of speech tagging module is responsible for segmenting text sentences into words and tagging the parts of speech thereof to generate corpus extracted by key words. The keyword extraction module provides a keyword extraction algorithm based on the LDA topic model and is responsible for extracting keywords from the corpus to generate domain knowledge information. The clustering module is responsible for clustering and optimizing the domain knowledge information and generating a final urban domain knowledge report. The data transmission module is responsible for interworking with the client. The system adopts a C/S architecture, a user can complete tasks by using a high-performance server without configuring an environment, and transmission reliability is ensured by adopting TCP/IP. Each module provides a user-defined option, and a user can configure each parameter according to the needs. The system adopts a modularized design idea, each module can independently operate, and simultaneously supports hot update, so that technical iteration can be realized without manual operation of a user.

Fig. 2 is a block diagram of a data cleansing module according to the present invention. And for the input data, the data cleaning module firstly cleans the repeated data, and in the step, the data cleaning module receives the cleaning parameters set by the user and cleans the repeated data according to the user-defined cleaning strength. And then, the data is imported into a non-valuable data cleaning sub-module, and the data is further cleaned according to the received pattern character string. The cleaned data is then led into a special symbol cleaning sub-module, and in this step, the data is cleaned according to the special symbol library maintained by the data cleaning module, and the user can also import a self-set special symbol library to perform personalized cleaning. The data after the data cleaning operation is exported as input for the next operation, and can also be returned to the client for viewing the results.

As shown in fig. 3, the emotion analysis module structure of the present invention is shown. For input text data, the emotion analysis module calculates the emotion intensity of each piece of data according to a weighted average algorithm according to an emotion dictionary, classifies each piece of data into positive emotion or negative emotion according to a received emotion threshold value and classifies and outputs the positive emotion or the negative emotion, and a user can also construct a personalized emotion dictionary according to the user's own needs and upload the personalized emotion dictionary to a server side, so that personalized text emotion analysis can be realized. The data subjected to emotion analysis operation is exported as input of the next operation, and can also be returned to the client for viewing the result.

As shown in FIG. 4, a block diagram of the word segmentation and part-of-speech tagging module of the present invention is shown. For input data, the word segmentation and part of speech tagging module segments each sentence into words according to a word segmentation dictionary and a bi-directional maximum matching method, and tags the words according to the part of speech in the word segmentation dictionary. The user can also import a custom word segmentation dictionary to supplement or replace a default dictionary, such as adding special words in some specific fields to meet the word segmentation requirements of data in different fields. The data subjected to the word segmentation and part-of-speech tagging operations is exported as input of the next operation, and can also be returned to the client for checking the result.

Fig. 5 is a flowchart of the keyword extraction module algorithm according to the present invention. The method comprises the following specific steps:

(1) Performing part-of-speech filtering on the word segmentation corpus; and filtering the parts of speech of the input data according to part of speech filtering rules set by the user, namely traversing each word segment, checking the parts of speech of the word segment, and if the parts of speech are consistent with the parts of speech set by the user, keeping, otherwise, clearing.

(2) Reading corpus, and calculating weight values of candidate keywords based on an inverse TF-IDF algorithm; for each word segmentation corpus subjected to part-of-speech filtering, calculating an initial weight value of each word segmentation according to an inverse TF-IDF algorithm, wherein the initial weight value is as follows:

Weight(i,iTF-IDF)＝tf(i)×iidf(i)×length(i)

(3) Reading corpus, and calculating weight values of candidate keywords based on the LDA topic model; the LDA model can acquire the topic distribution and the word distribution of each piece of data, and for each word segmentation corpus subjected to part-of-speech filtering, the LDA weight value of each word segmentation is calculated based on the LDA topic model, and the LDA weight value is as follows:

Weight(i,LDA)＝α×Σsim(i,j)+(1-α)×sim(i,p)

wherein sim (i, j) represents the correlation between word i and word j, sim (i, p) represents the correlation between word i and corpus p, and the weighting coefficient α can be set by the user by himself, and defaults to 0.2. The LDA topic model is a generative probabilistic model that defines each corpus in the corpus as a random mix of implicit topic sets, so that the corpus can be transformed into a set of implicit topics. Inputting the corpus in the step (1) into a trained LDA model to obtain probability distribution of each word on a topic and probability distribution of each corpus on the topic, and calculating the relevance between the words and the corpus through cosine similarity.

(4) Calculating a weighted weight value and extracting keywords; for each word corpus subjected to part-of-speech filtering, calculating a final weight value of each word, wherein the final weight value is defined as a weighted sum of a keyword weight value based on an inverse TF-IDF algorithm and a keyword weight value based on an LDA topic model, namely:

Weight(i)＝λ×weight(i,iTF-IDF)+(1-λ)×weight(i,LDA)

wherein weight (i, iTF-IDF) is obtained in step (2) and weight (i, LDA) is obtained in step (3). Wherein the weighting coefficient lambda can be set by the user himself, defaulting to 0.15. After the final weight value is calculated, the keywords with the highest weight are extracted for each corpus according to the size sequence of the weight value.

(5) Calculating co-occurrence frequency among keywords; the phrase can contain more information than the word, and thus the extracted keywords are expanded to generate a key phrase. The grammar structures of key phrases in different fields are different, so that a user designates a proper grammar structure according to the needs, the algorithm calculates the co-occurrence frequency between the key words (namely, the frequency of the co-occurrence frequency which simultaneously occurs in the same text and the distance is not more than 8 bytes), and if a certain threshold value beta is reached and the designated grammar structure is met, the key phrases are expanded. Wherein β can be set by the user himself, defaulting to 3 according to the test performance.

The extraction method of the algorithm covers the grammar characteristics of most Chinese domain knowledge, and the key word extraction is carried out on domain knowledge sentences by using the algorithm, so that most key phrases containing domain knowledge information can be extracted.

In a word, the invention provides an urban domain knowledge detection method based on an LDA topic model, which is oriented to the cross application field of big data and natural language processing, and adopts a modularized system design method, and adopts a C/S architecture to design an authentication and management module, a file uploading module, a data cleaning module, an emotion analysis module, a word segmentation and part-of-speech labeling module, a keyword extraction module, a clustering module and a data transmission module. The modules are all updateable and replaceable and can be adapted for future maintenance, improvement and expansion as may be required. Aiming at the personalized needs of different users, each module provides a customizable parameter or file, so that the system not only meets the generalized needs of common users, but also meets the customized needs of professional users.

What is not described in detail in the present specification belongs to the prior art known to those skilled in the art.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and alternative arrangements included within the spirit and scope of the appended claims.

Claims

1. An urban domain knowledge detection system based on an LDA topic model, comprising: the client and the server are connected through a network; the method comprises the steps of terminating an authentication and management module, a file uploading module and a data transmission module at a client; the method comprises the steps of terminating an authentication and management module, a data cleaning module, an emotion analysis module, a word segmentation and part of speech tagging module, a keyword extraction module, a clustering module and a data transmission module at a server, wherein:

and a client file uploading module: the data transmission module is used for uploading the data file to be processed to the client data transmission module by the user; the client file uploading module comprises a visual interface to enable the uploaded file content to be visible; the client file uploading module further comprises a crawling sub-module for crawling urban domain knowledge information of the corresponding website, and the crawling sub-module is used as supplement of an input file or directly used as input;

Client data transmission module: the server-side data transmission module is used for encapsulating the client-side data information and sending the client-side data information to the server-side, and analyzing the information received from the server-side data transmission module; the client data information comprises user identity information of an identity verification module, a data file of a file uploading module and various parameters and dictionary files which are user-defined and supported by each module of a server, and the data transmission is based on TCP/IP so as to ensure the reliability of the transmission; the file transmission is based on FTP to ensure the transmission efficiency;

server-side data cleaning module: the method comprises the steps of performing data cleaning on received original data to obtain cleaned data so as to improve data quality; the server-side data cleaning module comprises three sub-modules: the system comprises a repeated data cleaning module, a non-value data cleaning module and a special symbol cleaning module; the repeated data are data with the same characters or similar semantics, the worthless data are data irrelevant to the domain knowledge to be extracted by a user, and the special symbols are messy codes generated by different encoding or decoding modes or special symbols irrelevant to the domain knowledge; the repeated data cleaning module receives cleaning parameters set by a user and cleans the repeated data according to the user requirement; the non-value data cleaning module receives a pattern character string input by a user and cleans the non-value data according to a matching rule; the special symbol cleaning module maintains a special symbol library which contains various special symbols commonly used on a network, and performs stronger filtering on data according to the special symbol library;

the server-side keyword extraction module: extracting keywords from the data after word segmentation and part-of-speech tagging; filtering parts of speech through a keyword extraction algorithm based on an LDA topic model, screening candidate keywords, carrying out preliminary weighting on the candidate keywords through an inverse TF-IDF algorithm, carrying out weighting on the candidate keywords based on the LDA model, calculating a weighted weight as the total weight of each candidate keyword, extracting the keywords with the highest weight and generating key phrases; the LDA model is a topic model, the word distribution and topic distribution of each piece of data are obtained through the LDA model, and then the weight is obtained through calculating cosine similarity;

Server-side clustering module: for clustering similar key phrases; training a word2vec model in advance during clustering, converting all input key phrases into word2vec word vectors, calculating similarity between the word vectors, and clustering semantically similar key phrases according to a similarity threshold set by a user; after clustering is completed, counting word frequencies of the key phrases and sequencing, generating a city domain knowledge report and returning the city domain knowledge report to the client; the urban domain knowledge, namely urban domain informationized data, comprises traffic flow data, electric business comment data, social network behavior data and all the urban domain information which has a certain data volume and can be collected;

2. An LDA topic model-based urban domain knowledge detection system as claimed in claim 1 wherein said system further comprises: the server side emotion analysis module; for the data cleaned by the server-side data cleaning module, if the data is text data and the text emotion classification is required, a server-side emotion analysis module is adopted;

the server-side emotion analysis module is used for carrying out emotion analysis on the text data, classifying the text data into two types of positive emotion and negative emotion, and facilitating a user to select different types of data for subsequent operation according to self needs; the emotion analysis module maintains an emotion dictionary constructed based on a language knowledge base, calculates the emotion intensity of each piece of text data input into the emotion analysis module according to a weighted average algorithm, and classifies the text data according to a received threshold set by a user; the emotion analysis module also receives the custom emotion dictionary uploaded by the user to replace or supplement the default dictionary for emotion analysis.

3. An LDA topic model-based urban domain knowledge detection system as claimed in claim 1 or 2 wherein: the keyword extraction algorithm based on the LDA topic model in the server-side keyword extraction module is realized as follows:

Weight(i,iTF-IDF)＝tf(i)×iidf(i)×length(i)

(3) Based on keyword weighting of an LDA topic model, the LDA model acquires topic distribution and word distribution of each piece of data, the corpus in the step (1) is input into the trained LDA model to obtain probability distribution of each word on the topic and probability distribution of each corpus on the topic, and then the relevance between the words and the corpus are calculated through cosine similarity; keyword weights based on the LDA topic model are defined as:

Weight(i,LDA)＝α×Σsim(i,j)+(1-α)×sim(i,p)

wherein sim (i, j) represents the correlation between word i and word j, sim (i, p) represents the correlation between word i and corpus p, and the weighting coefficient α is set by the user himself;

Weight(i)＝λ×weight(i,iTF-IDF)+(1-λ)×weight(i,LDA)

(5) Keyword expansion: expanding the extracted keywords to generate key phrases, calculating co-occurrence frequency among the keywords in each corpus, namely the frequency of simultaneous occurrence in the same text and the distance not exceeding 8 bytes, and expanding the keywords into the key phrases if a certain threshold beta is reached and the part-of-speech combination specified by a user is met, wherein beta is set by the user by himself and defaults to 3 according to test performance.

4. The urban domain knowledge detection method based on the LDA topic model is characterized by comprising the following steps of:

(1) After a user opens a client, an authentication and management module inputs a user name and a password in an input box according to a prompt, the user clicks a 'confirmed' information and then transmits the information to a client data transmission module, the client data transmission module establishes TCP connection with a server data transmission module, the information is uploaded to the server, an identity authentication module at the server extracts the information from the server data transmission module and searches a user database, if the search is successful, the client returns a prompt message of 'successful login', the prompt enters an operation interface, and the server allocates a working space according to user rights; if the searching fails, the client returns a prompt message of 'user name or password error, please input again', and the authentication and management module is loaded again;

(2) The user inputs a file containing knowledge information of the urban area to be detected, and two choices are provided for the operation interface: firstly, crawling the corresponding urban domain knowledge from a specified webpage by using a built-in crawler program, secondly, directly uploading the collected urban domain knowledge from the local, and enabling the uploaded file to enter a data cleaning stage reaching a server through a data transmission module;

(3) In the data cleaning stage, cleaning the data according to the needs of users, filtering junk data, namely, data irrelevant to the field or influencing extraction of key information, and reserving the data expected by the users, wherein the data cleaning stage specifically comprises repeated data cleaning, worthless data cleaning and special symbol cleaning; repeated data cleaning is to clean completely consistent data, so that redundancy is reduced; the worthless data is meaningless data, and comprises continuous numbers or letters appearing in comments of the electric users, and the users designate corresponding mode character strings to clean the corresponding mode character strings; the special symbols are cleaned to remove some network special characters, so that the data is further simplified; if the data after data cleaning is text data and there is a need for text emotion classification, importing an emotion analysis stage for next operation, otherwise importing a word segmentation and part-of-speech tagging stage;

(4) In the emotion analysis stage, the system classifies the data into two types of positive emotion and negative emotion according to the data content, and the classification basis is as follows: if a data expresses positive emotion, its emotional tendency is positive; if one data expresses negative emotion, the emotion tendency of the data is negative, the negative emotion is converted into two classification problems in the text emotion analysis field, an emotion dictionary is built on the basis of a language knowledge base by adopting a semantic-based emotion dictionary method, the emotion intensity of each text data is calculated by using a weighted average algorithm, and the emotion tendency of the text data is judged according to a set threshold; the text data after emotion analysis is classified into two sub-data blocks of positive emotion and negative emotion, a user decides which block is selected for further processing, and the text data after emotion analysis can be selected completely, so that different requirements of the user can be met, and the text data after emotion analysis is imported into a word segmentation stage and a part-of-speech tagging stage;

(5) In the stage of word segmentation and part-of-speech tagging, a text sentence is segmented into basic expression units of Chinese, namely words, a word segmentation algorithm is a word segmentation method based on character string matching, namely, matching a Chinese character string to be analyzed with a set Chinese dictionary according to a certain strategy, and if matching is successful, segmenting out a word; the adopted matching method is a bidirectional maximum matching method, namely, character strings are scanned from left to right and from right to left twice, so that word segmentation accuracy is improved; the text data after word segmentation and part-of-speech tagging is imported into a key word extraction stage;

(6) In the keyword extraction stage, word segmentation corpus is input into a keyword extraction module, a weight value of each candidate keyword is calculated through a keyword extraction algorithm based on an LDA topic model to extract specific keywords, and a key phrase is generated; the keyword extraction algorithm based on the LDA topic model is realized as follows:

the keyword extraction algorithm based on the LDA topic model in the keyword extraction module is to calculate and compare the weight values of candidate keywords, screen out keywords in a text, and comprises 5 parts, namely:

(a) Part-of-speech filtering: the corpus subjected to word segmentation and part-of-speech tagging is input into a keyword extraction stage, part-of-speech filtering is performed first, and candidate keywords are screened out;

(b) Based on the preliminary weighting of keywords of the inverse TF-IDF algorithm, assigning an initial weight to each candidate keyword, wherein the initial weight of the candidate keyword is as follows:

Weight(i,iTF-IDF)＝tf(i)×iidf(i)×length(i)

in the above formula, tf (i) represents the occurrence frequency of the word i in the corpus, iidf (i) is the reciprocal of idf (i), the occurrence frequency of the constant word i in the corpus, length (i) is the length of the word i, and the initial weight value is the product of the three;

(c) Based on keyword weighting of the LDA topic model, the LDA model can acquire topic distribution and word distribution of each piece of data, the corpus in (a) is input into the trained LDA model, probability distribution of each word on the topic and probability distribution of each corpus on the topic are obtained, and the relevance between the words and the corpus are calculated through cosine similarity; keyword weights based on the LDA topic model are defined as:

Weight(i,LDA)＝α×Σsim(i,j)+(1-α)×sim(i,p)

Wherein sim (i, j) represents the relativity between word i and word j, sim (i, p) represents the relativity between word i and corpus p, and the weighting coefficient alpha is set by the user himself, and defaults to 0.2 according to the test performance;

(d) And calculating the final weight of the keywords, wherein the weight of each candidate keyword is defined as the weighted sum of the weight value of the keywords based on the inverse TF-IDF algorithm and the weight value of the keywords based on the LDA topic model, namely:

Weight(i)＝λ×weight(i,iTF-IDF)+(1-λ)×weight(i,LDA)

(e) Expanding keywords, namely expanding the extracted keywords to generate key phrases, calculating the co-occurrence frequency among the keywords in each corpus, namely, the frequency of the co-occurrence frequency among the keywords in the same text, wherein the distance is not more than 8 bytes, and expanding the keywords into the key phrases if the distance reaches a set threshold beta and the combined structure of nouns and adjectives is met; the generated key phrase is imported into a clustering stage for the next operation;

(7) In the clustering stage, all key phrases are converted into word2vec word vectors, the similarity between the vectors is calculated, if the similarity is larger than a threshold set by a user, the partial vectors are clustered into the same phrase, after all the key phrases are clustered, word frequencies are counted and ordered, and the generated urban domain knowledge information is returned to a client.