CN110347828B - Subway passenger demand dynamic acquisition method and acquisition system thereof - Google Patents
Subway passenger demand dynamic acquisition method and acquisition system thereof Download PDFInfo
- Publication number
- CN110347828B CN110347828B CN201910561357.XA CN201910561357A CN110347828B CN 110347828 B CN110347828 B CN 110347828B CN 201910561357 A CN201910561357 A CN 201910561357A CN 110347828 B CN110347828 B CN 110347828B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- demand
- cluster
- texts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention discloses a dynamic subway passenger demand acquisition method and an acquisition system thereof, wherein the method comprises the following steps: step 1: constructing a requirement word bank, and acquiring user text data from a social network platform; step 2: preprocessing the acquired data; and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier; and 4, step 4: performing relevance clustering; and 5: for each cluster, giving a label as a demand item, and calculating the importance of the demand item; step 6: judging whether the required item exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance degree and the relative propagation persistence degree of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting; the invention can process a large amount of user texts, improves the user demand acquisition efficiency and has low subjectivity; the requirement preference and the potential user requirement can be acquired from mass user messages in real time.
Description
Technical Field
The invention discloses a dynamic subway passenger demand acquisition method, and particularly relates to a dynamic subway passenger demand acquisition method and a dynamic subway passenger demand acquisition system.
Background
Over the last 10 years, the transportation capacity of railways is gradually enhanced, and the turnover of passengers is gradually increased. The increase of passenger capacity and turnover volume of urban railways and high-speed railways further increases the density of a rail transit line network and increases the number of orders of subway vehicles. This provides opportunities and challenges for metro vehicle manufacturing enterprises. The customers of the rail vehicle manufacturing enterprises comprise operation enterprises and passengers, however, at present, the rail vehicle manufacturing enterprises mainly pay attention to the requirements of the operation enterprises and lack analysis on the requirements of the passengers, so that the satisfaction degree of terminal customers on the products of the rail vehicle manufacturing enterprises is influenced, and the market competitiveness of the enterprises is not favorably improved.
The passenger requirements, including the passenger requirement items and their importance, are dynamically changing over time, and the existing requirement acquisition methods, such as questionnaires, etc. When dynamic passenger demands are acquired, not only a large amount of manpower is consumed, but also the subjectivity is high, so that the analysis of the passenger demands by rail vehicle manufacturing enterprises is restricted.
Disclosure of Invention
The invention provides a subway passenger demand dynamic acquisition method and system with high data acquisition efficiency and low subjectivity.
The technical scheme adopted by the invention is as follows: a dynamic subway passenger demand acquisition method comprises the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
step 2: preprocessing the data acquired in the step 1;
and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier;
and 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
step 6: and 5, judging whether the required item obtained in the step 5 already exists in the required word stock, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold, if so, finding a new required item, adding the new required item into the required word stock, and if not, exiting.
Further, the data acquisition process in step 1 is as follows:
searching the words in the requirement word bank as key words in the social network platform to obtain user texts; and acquiring text data through a web crawler.
Further, the specific process of step 3 is as follows:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
s12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
the training sample information entropy IG (X) calculation process is as follows:
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) value calculation process for each word is as follows:
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant text and irrelevant text, respectively, C, D is the frequency of non-occurrence of each word in the relevant text and irrelevant text, respectively;
s13: calculating the characteristic value of the characteristic word in each text, and expressing the text as a characteristic value vector;
s14: constructing a support vector machine classifier according to the training sample, and perfecting the classifier by using the test sample;
s15: and classifying the data into a demand related text and an irrelevant text by using the support vector classifier obtained in the step S14, and removing the irrelevant text.
Further, the K-means clustering method for profile coefficient modification in step 4 is to first perform K-means clustering and then determine the optimal number K of clusters through the profile coefficient;
the K-means clustering process is as follows:
determining the distance square sum dist (S) from each point in a certain cluster to the cluster centerk):
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
contour coefficient L (x)i) The following were used:
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster, b (x)i) As a text xiAnd xiAverage distance of all texts in the other cluster;
average contour coefficient L (x)kComprises the following steps:
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
Further, the step 5 importance calculation process is as follows:
s21: heat of propagation rkThe following were used:
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
s22: the heat of propagation is corrected by the extent of propagation:
r′k=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a spread, gk=ls/ns,lsThe number of users sending a text in each cluster;
s23: degree of importance RkThe calculation method is as follows:
in the formula: s is total text set number, r'iAnd i is the propagation heat after the ith requirement is corrected, and i is a requirement item label.
Further, the relative propagation persistence in step 6 is calculated as follows:
s31: propagation persistence jkThe following were used:
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained;
s32: relative propagation persistence JkComprises the following steps:
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
Further, in step S13, the feature value is measured by the word frequency-inverse document word frequency, and the calculation method of the word frequency-inverse document word frequency TF-IDF is as follows:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
A dynamic subway passenger demand acquisition system is characterized by comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;
the demand lexicon is used for storing demand items related to the passenger demands of the subway vehicles;
the data acquisition module is used for acquiring text data in the social network platform;
the text preprocessing module is used for preprocessing the acquired text;
the text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts;
the text clustering module is used for performing relevance clustering on the filtered text data;
the demand extraction module is used for extracting demand items in each cluster;
and the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock.
The invention has the beneficial effects that:
(1) according to the invention, a large amount of user texts are obtained through the web crawler, the passenger requirements are obtained, the user requirement obtaining efficiency is improved, and the subjectivity is low;
(2) the invention can analyze the dynamic requirements of mass users in real time and continuously capture the requirement preference of passengers, thereby acquiring the effective requirement importance of the passengers.
(3) The invention can automatically discover the emerging and potential user requirements in real time.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a diagram illustrating a result of calculating a contour coefficient according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of the system of the present invention.
FIG. 4 is a schematic diagram illustrating a trend of a change in demand of passengers according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
As shown in fig. 1, a dynamic acquiring method for subway passenger demand includes the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
and acquiring user texts from the social network platform based on the requirement word bank. The requirement word bank is a set of words related to the passenger requirements of the subway vehicle, and comprises passenger requirement items, railway vehicle product names and the like. And (3) taking the words in the requirement word bank as key words, such as subway speed, retrieving relevant user texts from the social network platform, and acquiring the text data through a network crawler technology. In the embodiment, the search is carried out by keywords such as 'wifi of subway', 'speed of subway', 'stability of subway', and the like.
The words such as the passenger requirement item (such as speed), the subway vehicle product name (such as subway) and the like stored in the requirement word library are predefined according to actual expressions, and the contents can be continuously enriched through the subsequent steps of the technical scheme of the invention.
Step 2: preprocessing the data acquired in the step 1;
the preprocessing comprises preliminary filtering, word segmentation, part-of-speech tagging and the like of the acquired text. The method comprises the following three steps:
1) and (4) formulating a filtering rule by combining the text-sending characteristics of the social platform, and preliminarily filtering the text according to the formulated rule. The filtering rule is the basis of the preliminary filtering and is written in the form of a production rule. Whether to filter is judged by analyzing whether the text contains noise characters (such as, #, [ phi ]).
2) And performing word segmentation and part-of-speech tagging on the text subjected to preliminary filtering. The word segmentation is to segment the text into words, and the part-of-speech tagging is to attach the labels such as nouns and verbs to the segmented words.
3) The filtering of the words without entity meaning comprises two parts, namely filtering the stop words such as 'the' and 'the' by combining the existing stop word list. And combining the parts of speech to filter out words except nouns, verbs and adjectives, such as adverbs, pronouns and the like.
And step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier;
after the processing of step 1 and step 2, the noise is preliminarily filtered, but still contains a large amount of noise. This part of the noisy text appears as the main object of the description is not a subway car, but contains the keywords used in step 1 for retrieval. For example, the speed of a grand sitting on a subway warns me, and the text cannot reflect the requirement of passengers on subway vehicles. The filtering of the noise text can be regarded as performing two classifications of the text, which mainly includes the following steps:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
and randomly sampling the preprocessed text, and manually generating a training sample and a test sample. The sampling must ensure two principles, one is that the content of the sample must relate to the content retrieved by each keyword in step 1, and the number of samples in the content retrieved by each keyword is in direct proportion to the number of the content retrieved by each keyword.
S12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
and selecting characteristic words capable of identifying related texts and unrelated texts, such as 'mom' and 'rob seat', based on the training samples. The method adopting the information gain characteristic selection comprises the following steps:
the information gain is a feature selection method for determining feature words according to the information quantity contained in the words, the information quantity is represented by information entropy, and the calculation process is as follows:
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) for each word is calculated as follows:
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant and irrelevant text, respectively, and C, D is the frequency of non-occurrence of each word in the relevant and irrelevant text, respectively.
Sorting each word from big to small according to the increase of information, selecting a word with a larger value as a feature word, and obtaining a part of calculation results of the embodiment as shown in table 1:
TABLE 1 information gain value ordering
Sorting | Word and phrase | Information gain value |
1 | Robbing seat | 0.9744340029 |
2 | Mother | 0.9631205685 |
3 | Punching card | 0.8819280948 |
4 | Sprint for acupuncture | 0.8529583405 |
5 | Transfer of | 0.8329984805 |
… | … | … |
S13: calculating a characteristic value of the characteristic word, and expressing the text as a characteristic value vector;
the word frequency-inverse document word frequency is a characteristic value calculation method comprehensively considering the occurrence frequency (TF) of a word in a text and the occurrence frequency (IDF) of other texts, and the word frequency-inverse document word frequency calculation method comprises the following steps:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
S14: constructing a support vector machine classifier according to the training samples, and classifying the test samples for training;
according to the test result of each test sample, the training samples are expanded to increase the coverage of the training samples to different types of noise and improve the classifier.
S15: and classifying the data into a demand related text and an irrelevant text by using the support vector classifier obtained in the step S14, and removing the irrelevant text.
And 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 3, carrying out K-means clustering on the data obtained by filtering in the step 3, and determining the optimal cluster number K through the contour coefficient.
The K-means clustering process is as follows:
the K mean value is used for classifying texts according to the distance between the texts, the distance between the texts is the correlation degree of the texts, Euclidean distance is adopted for measurement, and the distance square sum dist (S) from each point in a certain cluster to the cluster center is determinedk):
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
the objective of K-means clustering is to achieve the minimum sum of squares of distances from all samples in a cluster domain to the cluster center; the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
the contour coefficient is a coefficient for measuring the clustering result by combining two factors of cohesion and separation. The larger the contour coefficient is, the better the clustering effect is, otherwise, the worse the contour coefficient is, the following contour coefficient calculation formula is:
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster as it, used to quantify the degree of aggregation in the cluster, b (x)i) As a text xiAnd xiAnd traversing all other clusters according to the average distance of all texts in the other cluster to find the nearest average distance for quantifying the inter-cluster separation degree.
Determining cluster number by using average contour coefficient of whole text set, average contour coefficient L (x)kComprises the following steps:
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
FIG. 2 shows partial calculation results of the present invention, and it can be seen from FIG. 2 that when K is 4, the maximum average contour coefficient is obtained, i.e. the best result of K-means clustering is obtained.
And 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
extracting labels from each cluster as demand items, sequencing the word occurrence times from large to small according to the occurrence times of each word in the cluster, recommending the word occurrence times to engineers with larger occurrence times, and summarizing the labels of the type, namely the demand items, by the engineers. The partial calculation results for a cluster according to this embodiment are shown in table 2. The 'subway noise' can be selected as a requirement item.
TABLE 2 number of occurrences of words
Sorting | Word and phrase | Number of times of occurrence of words |
1 | Subway | 541 |
2 | Ear | 426 |
3 | Noise(s) | 346 |
4 | Sound equipment | 312 |
… | … | … |
The importance of the passenger demand item is measured by the relative heat of propagation of the demand item. The calculation formula of the propagation heat is as follows:
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
w1、w2and w3Respectively representing the weight of forwarding, praise and comment, and satisfying w1+w2+w3=1。
To prevent the user from repeatedly sending text, the spreading width g is usedk=ls/nsCorrecting the heat of propagation, whereinsFor the number of users who issue text in each cluster, the modified heat of propagation is expressed as:
r′k=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a broad extent of propagation,/sThe number of users sending a text in each cluster;
the calculation formula of the relative propagation heat, namely the importance is as follows:
in the formula: s is total text set number, r'iAnd i is the propagation heat after the ith requirement is corrected, and i is a requirement item label.
Step 6: and (5) judging whether the required item obtained in the step (5) exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, adding the required item into the required word bank, and if not, exiting.
And evaluating the obtained demand according to the propagation heat and the propagation persistence of the demand item, and judging whether the demand is a new demand. The demand items which are not in the demand thesaurus may appear in the acquired passenger demand items, and the demand items need to be judged according to the propagation heat and the propagation persistence, so as to judge whether the demand items can be supplemented to the demand thesaurus as new demand items. The method mainly comprises two steps:
1) matching the acquired requirement item with the existing requirement item in the requirement word bank, and determining whether the requirement item is a requirement item which is not in the word bank;
2) and comparing the relative propagation heat and the relative propagation persistence of the demand item with preset thresholds. The propagation persistence is a measure of the propagation persistence of the new demand.
Propagation persistence jkThe following were used:
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained; the retrieval is dynamic, i.e., data is automatically retrieved from the social networking platform at intervals. This time refers to the discovery of the acquisition time period, r ', of emerging, potential demand'k1、r′k2I.e. the heat of propagation for the first and second future acquisition periods, respectively.
Relative propagation persistence JkComprises the following steps:
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
When the relative propagation heat and the relative propagation persistence of the new demand are simultaneously greater than the set threshold, the new demand can be used as a candidate new demand and then judged manually.
According to the method, a dynamic subway passenger demand acquisition system can be constructed, and comprises a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon; the system further comprises a requirement checking module and a requirement word stock management module, wherein the requirement checking module and the requirement word stock management module are used for realizing application and maintenance of the system by engineers.
The requirement word bank is used for storing requirement items related to the requirements of subway vehicle passengers, and specifically words related to the requirements of the subway vehicle passengers.
The data acquisition module is used for acquiring text data used in the social network platform; and taking the requirement items in the requirement word bank as keywords to capture relevant text data in the social network platform. In addition, the data can be acquired in real time by setting the acquisition frequency of the module.
The text preprocessing module is used for preprocessing the acquired text; preliminarily filtering the acquired text according to the filtering method; and performing word segmentation and part-of-speech tagging on the filtered text, and filtering words without entity meaning based on the stop word list and the part-of-speech.
The text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts; obtaining the characteristic words capable of identifying the text type by using an information gain characteristic selection method, obtaining the characteristic values of the characteristic words by using a word frequency-inverse document word frequency characteristic value calculation method, vectorizing the characteristic values of each text, and outputting the filtered text by using the vectorized characteristic values as input through a support vector machine classifier.
The text clustering module is used for performing relevance clustering on the filtered text data; and the method is used for clustering the filtered texts, and determining the number of clustering clusters by adopting a K-means clustering algorithm and using an average contour coefficient.
The demand extraction module is used for extracting demand items in each cluster; for extracting passenger requirements in each cluster; the module recommends the more frequent terms to the engineer, who gives the name of the demand item, by calculating the frequency of occurrence of each term in each cluster. The importance of the demand item is determined by the relative heat of propagation.
And the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock. And comparing the relative propagation heat and the relative propagation persistence of the new requirement with a set threshold, recommending the new requirement to an engineer when the threshold is met, judging whether the new requirement is met by the engineer, storing the new requirement into a requirement word bank, and updating the word bank.
The passenger demand extraction and evaluation system can further be provided with a demand checking module and a demand word stock management module, wherein the demand checking module adopts a display to provide a visual interface so as to extract, evaluate and check the passenger demand. The requirement extraction and evaluation implementation process is the same as the corresponding steps, and is not described again. In addition, the requirement items extracted by the requirement extraction module and the calculated importance are displayed in a graph form. For example for a subway car, in the form shown in fig. 4. In the graph, a curve A is an importance change curve of wifi of a subway, a curve B is an importance change curve of stability of the subway, a curve C is an importance change curve of speed of the subway, and a curve D is an importance change curve of noise of the subway.
And the demand lexicon management module is used for maintaining the demand lexicon, continuously enriching the demand lexicon according to the acquired new demand, and modifying and deleting the demand.
The method for acquiring the passenger demand of the subway vehicle not only needs to consume a large amount of manpower, but also has higher subjectivity. A subway passenger demand dynamic acquisition method and system based on a social network platform are provided. And adopting a text mining technology in data mining to mine the texts of the social network platform users to reflect the requirements of passengers on the subway vehicles. Compared with the traditional method, the method can automatically analyze a large amount of user texts, acquire potential passenger demands, improve the user data acquisition efficiency and reduce the subjective influence. The dynamic requirements of massive users can be analyzed in real time, the passenger requirement preference can be continuously captured, the effective passenger requirement importance can be extracted, and in addition, emerging and potential user requirements can be automatically found in real time and used as driving factors for research and development of subway vehicles.
Claims (6)
1. A dynamic subway passenger demand acquisition method is characterized by comprising the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
step 2: preprocessing the data acquired in the step 1;
and step 3: adopt support vector machine classifier to filter the text irrelevant with subway passenger's demand, specifically do:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
s12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
the training sample information entropy IG (X) calculation process is as follows:
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) for each word is calculated as follows:
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant text and irrelevant text, respectively, C, D is the frequency of non-occurrence of each word in the relevant text and irrelevant text, respectively;
s13: calculating the characteristic value of the characteristic word in each text, and expressing the text as a characteristic value vector;
s14: constructing a support vector machine classifier according to the training sample, and perfecting the classifier by using the test sample;
s15: classifying the data by using the support vector classifier obtained in the step S14 into a demand-related text and an irrelevant text, and removing the irrelevant text;
and 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
step 6: judging whether the required item obtained in the step 5 exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting;
the relative propagation persistence calculation process is as follows:
s31: propagation persistence jkThe following were used:
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained;
s32: relative propagation persistence JkComprises the following steps:
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
2. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said step 1 data acquisition process is as follows:
searching the words in the requirement word bank as key words in the social network platform to obtain user texts; and acquiring text data through a web crawler.
3. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said K-means clustering method for contour coefficient modification in step 4 is to first cluster by K-means and then determine the optimal cluster number K by contour coefficient;
the K-means clustering process is as follows:
determining the distance square sum dist (S) from each point in a certain cluster to the cluster centerk):
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
contour coefficient L (x)i) The following were used:
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster, b (x)i) As a text xiAnd xiAverage distance of all texts in the other cluster;
average contour coefficient L (x)kComprises the following steps:
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
4. The method of claim 1, wherein the step 5 importance calculation process comprises the following steps:
s21: heat of propagation rkThe following were used:
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
s22: the heat of propagation is corrected by the extent of propagation:
rk′=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a spread, gk=ls/ns,lsThe number of users sending a text in each cluster;
s23: degree of importance RkThe calculation method is as follows:
in the formula: s is the total text set number, ri' is the propagation heat corrected by the requirement of item i, and i is the requirement item label.
5. The method of claim 1, wherein the feature values in step S13 are measured by a term frequency-inverse document term frequency, and the term frequency-inverse document term frequency TF-IDF calculation method is as follows:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text where the word is located, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
6. The system for acquiring the dynamic subway passenger demand acquisition method according to any one of claims 1 to 5, comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;
the demand lexicon is used for storing demand items related to the passenger demands of the subway vehicles;
the data acquisition module is used for acquiring text data in the social network platform;
the text preprocessing module is used for preprocessing the acquired text;
the text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts;
the text clustering module is used for performing relevance clustering on the filtered text data;
the demand extraction module is used for extracting demand items in each cluster;
and the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561357.XA CN110347828B (en) | 2019-06-26 | 2019-06-26 | Subway passenger demand dynamic acquisition method and acquisition system thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910561357.XA CN110347828B (en) | 2019-06-26 | 2019-06-26 | Subway passenger demand dynamic acquisition method and acquisition system thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110347828A CN110347828A (en) | 2019-10-18 |
CN110347828B true CN110347828B (en) | 2022-03-15 |
Family
ID=68183218
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910561357.XA Active CN110347828B (en) | 2019-06-26 | 2019-06-26 | Subway passenger demand dynamic acquisition method and acquisition system thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347828B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114297401A (en) * | 2021-12-14 | 2022-04-08 | 中航机载系统共性技术有限公司 | System knowledge extraction method based on clustering algorithm |
CN114445141A (en) * | 2022-01-26 | 2022-05-06 | 西南交通大学 | Customer demand obtaining method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107909478A (en) * | 2017-11-27 | 2018-04-13 | 苏州点对点信息科技有限公司 | FOF mutual fund portfolio system and methods based on social network clustering and information gain entropy index |
CN107908753A (en) * | 2017-11-20 | 2018-04-13 | 合肥工业大学 | Customer demand method for digging and device based on social media comment data |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011072125A2 (en) * | 2009-12-09 | 2011-06-16 | Zemoga, Inc. | Method and apparatus for real time semantic filtering of posts to an internet social network |
US20130080212A1 (en) * | 2011-09-26 | 2013-03-28 | Xerox Corporation | Methods and systems for measuring engagement effectiveness in electronic social media |
CN103678564B (en) * | 2013-12-09 | 2017-02-15 | 国家计算机网络与信息安全管理中心 | Internet product research system based on data mining |
CN104484343B (en) * | 2014-11-26 | 2017-11-03 | 无锡清华信息科学与技术国家实验室物联网技术中心 | It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging |
CN108388660B (en) * | 2018-03-08 | 2021-10-01 | 中国计量大学 | Improved E-commerce product pain point analysis method |
CN109165996B (en) * | 2018-07-18 | 2022-02-11 | 浙江大学 | Product functional feature importance analysis method based on online user comments |
CN109829166B (en) * | 2019-02-15 | 2022-12-27 | 重庆师范大学 | People and host customer opinion mining method based on character-level convolutional neural network |
-
2019
- 2019-06-26 CN CN201910561357.XA patent/CN110347828B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107908753A (en) * | 2017-11-20 | 2018-04-13 | 合肥工业大学 | Customer demand method for digging and device based on social media comment data |
CN107909478A (en) * | 2017-11-27 | 2018-04-13 | 苏州点对点信息科技有限公司 | FOF mutual fund portfolio system and methods based on social network clustering and information gain entropy index |
Also Published As
Publication number | Publication date |
---|---|
CN110347828A (en) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110837931B (en) | Customer churn prediction method, device and storage medium | |
US10579646B2 (en) | Systems and methods for classifying electronic documents | |
CN103927675B (en) | Judge the method and device of age of user section | |
CN108733816B (en) | Microblog emergency detection method | |
Coussement et al. | Improving customer complaint management by automatic email classification using linguistic style features as predictors | |
EP3279804A1 (en) | Data analysis system, data analysis method, data analysis program, and recording medium | |
CN103309862B (en) | Webpage type recognition method and system | |
CN108038627B (en) | Object evaluation method and device | |
CN104077407B (en) | A kind of intelligent data search system and method | |
CN108550054B (en) | Content quality evaluation method, device, equipment and medium | |
CN110347828B (en) | Subway passenger demand dynamic acquisition method and acquisition system thereof | |
CN105225135B (en) | Potential customer identification method and device | |
CN108363748B (en) | Topic portrait system and topic portrait method based on knowledge | |
CN111079941A (en) | Credit information system combining expert experience model and supervised machine learning algorithm | |
CN110134799A (en) | A kind of text corpus based on BM25 algorithm build and optimization method | |
CN112418695A (en) | Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field | |
CN115099310A (en) | Method and device for training model and classifying enterprises | |
Ge et al. | Measure and Mitigate the Dimensional Bias in Online Reviews and Ratings. | |
CN112632958A (en) | Contract document examination and analysis method based on contract knowledge base | |
CN111625578A (en) | Feature extraction method suitable for time sequence data in cultural science and technology fusion field | |
CN114445141A (en) | Customer demand obtaining method | |
Han et al. | User requirements dynamic elicitation of complex products from social network service | |
Liu et al. | Identification of subway track irregularities based on detection data of portable detector | |
JP6320353B2 (en) | Digital marketing system | |
KR102078541B1 (en) | Issue interest based news value evaluation apparatus and method, storage media storing the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |