CN110347828B

CN110347828B - Subway passenger demand dynamic acquisition method and acquisition system thereof

Info

Publication number: CN110347828B
Application number: CN201910561357.XA
Authority: CN
Inventors: 黎荣; 黎伟洋; 王建; 丁国富; 张义军; 韩鑫; 郑宇飞
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2022-03-15
Anticipated expiration: 2039-06-26
Also published as: CN110347828A

Abstract

The invention discloses a dynamic subway passenger demand acquisition method and an acquisition system thereof, wherein the method comprises the following steps: step 1: constructing a requirement word bank, and acquiring user text data from a social network platform; step 2: preprocessing the acquired data; and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier; and 4, step 4: performing relevance clustering; and 5: for each cluster, giving a label as a demand item, and calculating the importance of the demand item; step 6: judging whether the required item exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance degree and the relative propagation persistence degree of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting; the invention can process a large amount of user texts, improves the user demand acquisition efficiency and has low subjectivity; the requirement preference and the potential user requirement can be acquired from mass user messages in real time.

Description

Subway passenger demand dynamic acquisition method and acquisition system thereof

Technical Field

The invention discloses a dynamic subway passenger demand acquisition method, and particularly relates to a dynamic subway passenger demand acquisition method and a dynamic subway passenger demand acquisition system.

Background

Over the last 10 years, the transportation capacity of railways is gradually enhanced, and the turnover of passengers is gradually increased. The increase of passenger capacity and turnover volume of urban railways and high-speed railways further increases the density of a rail transit line network and increases the number of orders of subway vehicles. This provides opportunities and challenges for metro vehicle manufacturing enterprises. The customers of the rail vehicle manufacturing enterprises comprise operation enterprises and passengers, however, at present, the rail vehicle manufacturing enterprises mainly pay attention to the requirements of the operation enterprises and lack analysis on the requirements of the passengers, so that the satisfaction degree of terminal customers on the products of the rail vehicle manufacturing enterprises is influenced, and the market competitiveness of the enterprises is not favorably improved.

The passenger requirements, including the passenger requirement items and their importance, are dynamically changing over time, and the existing requirement acquisition methods, such as questionnaires, etc. When dynamic passenger demands are acquired, not only a large amount of manpower is consumed, but also the subjectivity is high, so that the analysis of the passenger demands by rail vehicle manufacturing enterprises is restricted.

Disclosure of Invention

The invention provides a subway passenger demand dynamic acquisition method and system with high data acquisition efficiency and low subjectivity.

The technical scheme adopted by the invention is as follows: a dynamic subway passenger demand acquisition method comprises the following steps:

step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;

step 2: preprocessing the data acquired in the step 1;

and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier;

and 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;

and 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;

step 6: and 5, judging whether the required item obtained in the step 5 already exists in the required word stock, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold, if so, finding a new required item, adding the new required item into the required word stock, and if not, exiting.

Further, the data acquisition process in step 1 is as follows:

searching the words in the requirement word bank as key words in the social network platform to obtain user texts; and acquiring text data through a web crawler.

Further, the specific process of step 3 is as follows:

s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;

s12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;

the training sample information entropy IG (X) calculation process is as follows:

in the formula: x is a training sample set, N₁And N₂Respectively representing the number of relevant texts and the number of irrelevant texts;

the information gain value ig (word) value calculation process for each word is as follows:

in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant text and irrelevant text, respectively, C, D is the frequency of non-occurrence of each word in the relevant text and irrelevant text, respectively;

s13: calculating the characteristic value of the characteristic word in each text, and expressing the text as a characteristic value vector;

s14: constructing a support vector machine classifier according to the training sample, and perfecting the classifier by using the test sample;

s15: and classifying the data into a demand related text and an irrelevant text by using the support vector classifier obtained in the step S14, and removing the irrelevant text.

Further, the K-means clustering method for profile coefficient modification in step 4 is to first perform K-means clustering and then determine the optimal number K of clusters through the profile coefficient;

the K-means clustering process is as follows:

determining the distance square sum dist (S) from each point in a certain cluster to the cluster center_k)：

In the formula: s_kFor each cluster of text set, x_iIs S_kVector of eigenvalues of text in clusters, n_sIs S_kNumber of texts in a cluster, u_kIs S_kA cluster center of the cluster, i is a text label in the cluster;

wherein u is_kThe following were used:

the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:

in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;

contour coefficient L (x)_i) The following were used:

in the formula: a (x)_i) As a text x_iAverage of all other text distances in the same cluster, b (x)_i) As a text x_iAnd x_iAverage distance of all texts in the other cluster;

average contour coefficient L (x)_kComprises the following steps:

in the formula: n is the text number of the whole text set;

and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.

Further, the step 5 importance calculation process is as follows:

s21: heat of propagation r_kThe following were used:

in the formula: n is_sFor the number of texts in each cluster, Z_iFor the forwarding amount of the ith text in each cluster, D_iPerform like amount, P, of ith text in each cluster_iAmount of comments for the ith text in each cluster, w₁、w₂And w₃Is constant, k is the number of clusters;

s22: the heat of propagation is corrected by the extent of propagation:

r′_k＝r_k×g_k

in the formula: r'_kFor corrected heat of propagation, g_kTo a spread, g_k＝l_s/n_s，l_sThe number of users sending a text in each cluster;

s23: degree of importance R_kThe calculation method is as follows:

in the formula: s is total text set number, r'_iAnd i is the propagation heat after the ith requirement is corrected, and i is a requirement item label.

Further, the relative propagation persistence in step 6 is calculated as follows:

s31: propagation persistence j_kThe following were used:

in the formula: r'_k0、r′_k1、r′_k2Is propagation heat acquired in three consecutive time periods, wherein r'_k0The obtained propagation heat is obtained;

s32: relative propagation persistence J_kComprises the following steps:

in the formula: s is the total number of text sets, j_iAnd i is the propagation persistence of the ith requirement, and is a requirement item label.

Further, in step S13, the feature value is measured by the word frequency-inverse document word frequency, and the calculation method of the word frequency-inverse document word frequency TF-IDF is as follows:

TF-IDF(word)＝TF(word)×IDF(word)

in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;

wherein:

in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.

A dynamic subway passenger demand acquisition system is characterized by comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;

the demand lexicon is used for storing demand items related to the passenger demands of the subway vehicles;

the data acquisition module is used for acquiring text data in the social network platform;

the text preprocessing module is used for preprocessing the acquired text;

the text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts;

the text clustering module is used for performing relevance clustering on the filtered text data;

the demand extraction module is used for extracting demand items in each cluster;

and the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock.

The invention has the beneficial effects that:

(1) according to the invention, a large amount of user texts are obtained through the web crawler, the passenger requirements are obtained, the user requirement obtaining efficiency is improved, and the subjectivity is low;

(2) the invention can analyze the dynamic requirements of mass users in real time and continuously capture the requirement preference of passengers, thereby acquiring the effective requirement importance of the passengers.

(3) The invention can automatically discover the emerging and potential user requirements in real time.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a diagram illustrating a result of calculating a contour coefficient according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of the system of the present invention.

FIG. 4 is a schematic diagram illustrating a trend of a change in demand of passengers according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

As shown in fig. 1, a dynamic acquiring method for subway passenger demand includes the following steps:

and acquiring user texts from the social network platform based on the requirement word bank. The requirement word bank is a set of words related to the passenger requirements of the subway vehicle, and comprises passenger requirement items, railway vehicle product names and the like. And (3) taking the words in the requirement word bank as key words, such as subway speed, retrieving relevant user texts from the social network platform, and acquiring the text data through a network crawler technology. In the embodiment, the search is carried out by keywords such as 'wifi of subway', 'speed of subway', 'stability of subway', and the like.

The words such as the passenger requirement item (such as speed), the subway vehicle product name (such as subway) and the like stored in the requirement word library are predefined according to actual expressions, and the contents can be continuously enriched through the subsequent steps of the technical scheme of the invention.

Step 2: preprocessing the data acquired in the step 1;

the preprocessing comprises preliminary filtering, word segmentation, part-of-speech tagging and the like of the acquired text. The method comprises the following three steps:

1) and (4) formulating a filtering rule by combining the text-sending characteristics of the social platform, and preliminarily filtering the text according to the formulated rule. The filtering rule is the basis of the preliminary filtering and is written in the form of a production rule. Whether to filter is judged by analyzing whether the text contains noise characters (such as, #, [ phi ]).

2) And performing word segmentation and part-of-speech tagging on the text subjected to preliminary filtering. The word segmentation is to segment the text into words, and the part-of-speech tagging is to attach the labels such as nouns and verbs to the segmented words.

3) The filtering of the words without entity meaning comprises two parts, namely filtering the stop words such as 'the' and 'the' by combining the existing stop word list. And combining the parts of speech to filter out words except nouns, verbs and adjectives, such as adverbs, pronouns and the like.

after the processing of step 1 and step 2, the noise is preliminarily filtered, but still contains a large amount of noise. This part of the noisy text appears as the main object of the description is not a subway car, but contains the keywords used in step 1 for retrieval. For example, the speed of a grand sitting on a subway warns me, and the text cannot reflect the requirement of passengers on subway vehicles. The filtering of the noise text can be regarded as performing two classifications of the text, which mainly includes the following steps:

and randomly sampling the preprocessed text, and manually generating a training sample and a test sample. The sampling must ensure two principles, one is that the content of the sample must relate to the content retrieved by each keyword in step 1, and the number of samples in the content retrieved by each keyword is in direct proportion to the number of the content retrieved by each keyword.

and selecting characteristic words capable of identifying related texts and unrelated texts, such as 'mom' and 'rob seat', based on the training samples. The method adopting the information gain characteristic selection comprises the following steps:

the information gain is a feature selection method for determining feature words according to the information quantity contained in the words, the information quantity is represented by information entropy, and the calculation process is as follows:

the information gain value ig (word) for each word is calculated as follows:

in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant and irrelevant text, respectively, and C, D is the frequency of non-occurrence of each word in the relevant and irrelevant text, respectively.

Sorting each word from big to small according to the increase of information, selecting a word with a larger value as a feature word, and obtaining a part of calculation results of the embodiment as shown in table 1:

TABLE 1 information gain value ordering

Sorting	Word and phrase	Information gain value
			1	Robbing seat	0.9744340029
2	Mother	0.9631205685
			3	Punching card	0.8819280948
4	Sprint for acupuncture	0.8529583405
			5	Transfer of	0.8329984805
…	…	…

S13: calculating a characteristic value of the characteristic word, and expressing the text as a characteristic value vector;

the word frequency-inverse document word frequency is a characteristic value calculation method comprehensively considering the occurrence frequency (TF) of a word in a text and the occurrence frequency (IDF) of other texts, and the word frequency-inverse document word frequency calculation method comprises the following steps:

TF-IDF(word)＝TF(word)×IDF(word)

wherein:

S14: constructing a support vector machine classifier according to the training samples, and classifying the test samples for training;

according to the test result of each test sample, the training samples are expanded to increase the coverage of the training samples to different types of noise and improve the classifier.

and 3, carrying out K-means clustering on the data obtained by filtering in the step 3, and determining the optimal cluster number K through the contour coefficient.

The K-means clustering process is as follows:

the K mean value is used for classifying texts according to the distance between the texts, the distance between the texts is the correlation degree of the texts, Euclidean distance is adopted for measurement, and the distance square sum dist (S) from each point in a certain cluster to the cluster center is determined_k)：

wherein u is_kThe following were used:

the objective of K-means clustering is to achieve the minimum sum of squares of distances from all samples in a cluster domain to the cluster center; the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:

the contour coefficient is a coefficient for measuring the clustering result by combining two factors of cohesion and separation. The larger the contour coefficient is, the better the clustering effect is, otherwise, the worse the contour coefficient is, the following contour coefficient calculation formula is:

in the formula: a (x)_i) As a text x_iAverage of all other text distances in the same cluster as it, used to quantify the degree of aggregation in the cluster, b (x)_i) As a text x_iAnd x_iAnd traversing all other clusters according to the average distance of all texts in the other cluster to find the nearest average distance for quantifying the inter-cluster separation degree.

Determining cluster number by using average contour coefficient of whole text set, average contour coefficient L (x)_kComprises the following steps:

in the formula: n is the text number of the whole text set;

FIG. 2 shows partial calculation results of the present invention, and it can be seen from FIG. 2 that when K is 4, the maximum average contour coefficient is obtained, i.e. the best result of K-means clustering is obtained.

extracting labels from each cluster as demand items, sequencing the word occurrence times from large to small according to the occurrence times of each word in the cluster, recommending the word occurrence times to engineers with larger occurrence times, and summarizing the labels of the type, namely the demand items, by the engineers. The partial calculation results for a cluster according to this embodiment are shown in table 2. The 'subway noise' can be selected as a requirement item.

TABLE 2 number of occurrences of words

Sorting	Word and phrase	Number of times of occurrence of words
			1	Subway	541
2	Ear	426
			3	Noise(s)	346
4	Sound equipment	312
			…	…	…

The importance of the passenger demand item is measured by the relative heat of propagation of the demand item. The calculation formula of the propagation heat is as follows:

w₁、w₂and w₃Respectively representing the weight of forwarding, praise and comment, and satisfying w₁+w₂+w₃＝1。

To prevent the user from repeatedly sending text, the spreading width g is used_k＝l_s/n_sCorrecting the heat of propagation, wherein_sFor the number of users who issue text in each cluster, the modified heat of propagation is expressed as:

r′_k＝r_k×g_k

in the formula: r'_kFor corrected heat of propagation, g_kTo a broad extent of propagation,/_sThe number of users sending a text in each cluster;

the calculation formula of the relative propagation heat, namely the importance is as follows:

Step 6: and (5) judging whether the required item obtained in the step (5) exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, adding the required item into the required word bank, and if not, exiting.

And evaluating the obtained demand according to the propagation heat and the propagation persistence of the demand item, and judging whether the demand is a new demand. The demand items which are not in the demand thesaurus may appear in the acquired passenger demand items, and the demand items need to be judged according to the propagation heat and the propagation persistence, so as to judge whether the demand items can be supplemented to the demand thesaurus as new demand items. The method mainly comprises two steps:

1) matching the acquired requirement item with the existing requirement item in the requirement word bank, and determining whether the requirement item is a requirement item which is not in the word bank;

2) and comparing the relative propagation heat and the relative propagation persistence of the demand item with preset thresholds. The propagation persistence is a measure of the propagation persistence of the new demand.

Propagation persistence j_kThe following were used:

in the formula: r'_k0、r′_k1、r′_k2Is propagation heat acquired in three consecutive time periods, wherein r'_k0The obtained propagation heat is obtained; the retrieval is dynamic, i.e., data is automatically retrieved from the social networking platform at intervals. This time refers to the discovery of the acquisition time period, r ', of emerging, potential demand'_k1、r′_k2I.e. the heat of propagation for the first and second future acquisition periods, respectively.

Relative propagation persistence J_kComprises the following steps:

When the relative propagation heat and the relative propagation persistence of the new demand are simultaneously greater than the set threshold, the new demand can be used as a candidate new demand and then judged manually.

According to the method, a dynamic subway passenger demand acquisition system can be constructed, and comprises a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon; the system further comprises a requirement checking module and a requirement word stock management module, wherein the requirement checking module and the requirement word stock management module are used for realizing application and maintenance of the system by engineers.

The requirement word bank is used for storing requirement items related to the requirements of subway vehicle passengers, and specifically words related to the requirements of the subway vehicle passengers.

The data acquisition module is used for acquiring text data used in the social network platform; and taking the requirement items in the requirement word bank as keywords to capture relevant text data in the social network platform. In addition, the data can be acquired in real time by setting the acquisition frequency of the module.

The text preprocessing module is used for preprocessing the acquired text; preliminarily filtering the acquired text according to the filtering method; and performing word segmentation and part-of-speech tagging on the filtered text, and filtering words without entity meaning based on the stop word list and the part-of-speech.

The text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts; obtaining the characteristic words capable of identifying the text type by using an information gain characteristic selection method, obtaining the characteristic values of the characteristic words by using a word frequency-inverse document word frequency characteristic value calculation method, vectorizing the characteristic values of each text, and outputting the filtered text by using the vectorized characteristic values as input through a support vector machine classifier.

The text clustering module is used for performing relevance clustering on the filtered text data; and the method is used for clustering the filtered texts, and determining the number of clustering clusters by adopting a K-means clustering algorithm and using an average contour coefficient.

The demand extraction module is used for extracting demand items in each cluster; for extracting passenger requirements in each cluster; the module recommends the more frequent terms to the engineer, who gives the name of the demand item, by calculating the frequency of occurrence of each term in each cluster. The importance of the demand item is determined by the relative heat of propagation.

And the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock. And comparing the relative propagation heat and the relative propagation persistence of the new requirement with a set threshold, recommending the new requirement to an engineer when the threshold is met, judging whether the new requirement is met by the engineer, storing the new requirement into a requirement word bank, and updating the word bank.

The passenger demand extraction and evaluation system can further be provided with a demand checking module and a demand word stock management module, wherein the demand checking module adopts a display to provide a visual interface so as to extract, evaluate and check the passenger demand. The requirement extraction and evaluation implementation process is the same as the corresponding steps, and is not described again. In addition, the requirement items extracted by the requirement extraction module and the calculated importance are displayed in a graph form. For example for a subway car, in the form shown in fig. 4. In the graph, a curve A is an importance change curve of wifi of a subway, a curve B is an importance change curve of stability of the subway, a curve C is an importance change curve of speed of the subway, and a curve D is an importance change curve of noise of the subway.

And the demand lexicon management module is used for maintaining the demand lexicon, continuously enriching the demand lexicon according to the acquired new demand, and modifying and deleting the demand.

The method for acquiring the passenger demand of the subway vehicle not only needs to consume a large amount of manpower, but also has higher subjectivity. A subway passenger demand dynamic acquisition method and system based on a social network platform are provided. And adopting a text mining technology in data mining to mine the texts of the social network platform users to reflect the requirements of passengers on the subway vehicles. Compared with the traditional method, the method can automatically analyze a large amount of user texts, acquire potential passenger demands, improve the user data acquisition efficiency and reduce the subjective influence. The dynamic requirements of massive users can be analyzed in real time, the passenger requirement preference can be continuously captured, the effective passenger requirement importance can be extracted, and in addition, emerging and potential user requirements can be automatically found in real time and used as driving factors for research and development of subway vehicles.

Claims

1. A dynamic subway passenger demand acquisition method is characterized by comprising the following steps:

step 2: preprocessing the data acquired in the step 1;

and step 3: adopt support vector machine classifier to filter the text irrelevant with subway passenger's demand, specifically do:

the information gain value ig (word) for each word is calculated as follows:

s15: classifying the data by using the support vector classifier obtained in the step S14 into a demand-related text and an irrelevant text, and removing the irrelevant text;

step 6: judging whether the required item obtained in the step 5 exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting;

the relative propagation persistence calculation process is as follows:

s31: propagation persistence j_kThe following were used:

s32: relative propagation persistence J_kComprises the following steps:

2. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said step 1 data acquisition process is as follows:

3. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said K-means clustering method for contour coefficient modification in step 4 is to first cluster by K-means and then determine the optimal cluster number K by contour coefficient;

the K-means clustering process is as follows:

wherein u is_kThe following were used:

contour coefficient L (x)_i) The following were used:

average contour coefficient L (x)_kComprises the following steps:

in the formula: n is the text number of the whole text set;

4. The method of claim 1, wherein the step 5 importance calculation process comprises the following steps:

s21: heat of propagation r_kThe following were used:

s22: the heat of propagation is corrected by the extent of propagation:

r_k′＝r_k×g_k

s23: degree of importance R_kThe calculation method is as follows:

in the formula: s is the total text set number, r_i' is the propagation heat corrected by the requirement of item i, and i is the requirement item label.

5. The method of claim 1, wherein the feature values in step S13 are measured by a term frequency-inverse document term frequency, and the term frequency-inverse document term frequency TF-IDF calculation method is as follows:

TF-IDF(word)＝TF(word)×IDF(word)

wherein:

in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text where the word is located, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.

6. The system for acquiring the dynamic subway passenger demand acquisition method according to any one of claims 1 to 5, comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;

the text preprocessing module is used for preprocessing the acquired text;