CN110347828B - Subway passenger demand dynamic acquisition method and acquisition system thereof - Google Patents

Subway passenger demand dynamic acquisition method and acquisition system thereof Download PDF

Info

Publication number
CN110347828B
CN110347828B CN201910561357.XA CN201910561357A CN110347828B CN 110347828 B CN110347828 B CN 110347828B CN 201910561357 A CN201910561357 A CN 201910561357A CN 110347828 B CN110347828 B CN 110347828B
Authority
CN
China
Prior art keywords
text
word
demand
cluster
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910561357.XA
Other languages
Chinese (zh)
Other versions
CN110347828A (en
Inventor
黎荣
黎伟洋
王建
丁国富
张义军
韩鑫
郑宇飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN201910561357.XA priority Critical patent/CN110347828B/en
Publication of CN110347828A publication Critical patent/CN110347828A/en
Application granted granted Critical
Publication of CN110347828B publication Critical patent/CN110347828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The invention discloses a dynamic subway passenger demand acquisition method and an acquisition system thereof, wherein the method comprises the following steps: step 1: constructing a requirement word bank, and acquiring user text data from a social network platform; step 2: preprocessing the acquired data; and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier; and 4, step 4: performing relevance clustering; and 5: for each cluster, giving a label as a demand item, and calculating the importance of the demand item; step 6: judging whether the required item exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance degree and the relative propagation persistence degree of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting; the invention can process a large amount of user texts, improves the user demand acquisition efficiency and has low subjectivity; the requirement preference and the potential user requirement can be acquired from mass user messages in real time.

Description

Subway passenger demand dynamic acquisition method and acquisition system thereof
Technical Field
The invention discloses a dynamic subway passenger demand acquisition method, and particularly relates to a dynamic subway passenger demand acquisition method and a dynamic subway passenger demand acquisition system.
Background
Over the last 10 years, the transportation capacity of railways is gradually enhanced, and the turnover of passengers is gradually increased. The increase of passenger capacity and turnover volume of urban railways and high-speed railways further increases the density of a rail transit line network and increases the number of orders of subway vehicles. This provides opportunities and challenges for metro vehicle manufacturing enterprises. The customers of the rail vehicle manufacturing enterprises comprise operation enterprises and passengers, however, at present, the rail vehicle manufacturing enterprises mainly pay attention to the requirements of the operation enterprises and lack analysis on the requirements of the passengers, so that the satisfaction degree of terminal customers on the products of the rail vehicle manufacturing enterprises is influenced, and the market competitiveness of the enterprises is not favorably improved.
The passenger requirements, including the passenger requirement items and their importance, are dynamically changing over time, and the existing requirement acquisition methods, such as questionnaires, etc. When dynamic passenger demands are acquired, not only a large amount of manpower is consumed, but also the subjectivity is high, so that the analysis of the passenger demands by rail vehicle manufacturing enterprises is restricted.
Disclosure of Invention
The invention provides a subway passenger demand dynamic acquisition method and system with high data acquisition efficiency and low subjectivity.
The technical scheme adopted by the invention is as follows: a dynamic subway passenger demand acquisition method comprises the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
step 2: preprocessing the data acquired in the step 1;
and step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier;
and 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
step 6: and 5, judging whether the required item obtained in the step 5 already exists in the required word stock, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold, if so, finding a new required item, adding the new required item into the required word stock, and if not, exiting.
Further, the data acquisition process in step 1 is as follows:
searching the words in the requirement word bank as key words in the social network platform to obtain user texts; and acquiring text data through a web crawler.
Further, the specific process of step 3 is as follows:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
s12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
the training sample information entropy IG (X) calculation process is as follows:
Figure BDA0002108367190000021
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) value calculation process for each word is as follows:
Figure BDA0002108367190000022
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant text and irrelevant text, respectively, C, D is the frequency of non-occurrence of each word in the relevant text and irrelevant text, respectively;
s13: calculating the characteristic value of the characteristic word in each text, and expressing the text as a characteristic value vector;
s14: constructing a support vector machine classifier according to the training sample, and perfecting the classifier by using the test sample;
s15: and classifying the data into a demand related text and an irrelevant text by using the support vector classifier obtained in the step S14, and removing the irrelevant text.
Further, the K-means clustering method for profile coefficient modification in step 4 is to first perform K-means clustering and then determine the optimal number K of clusters through the profile coefficient;
the K-means clustering process is as follows:
determining the distance square sum dist (S) from each point in a certain cluster to the cluster centerk):
Figure BDA0002108367190000023
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
Figure BDA0002108367190000024
the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
Figure BDA0002108367190000025
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
contour coefficient L (x)i) The following were used:
Figure BDA0002108367190000031
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster, b (x)i) As a text xiAnd xiAverage distance of all texts in the other cluster;
average contour coefficient L (x)kComprises the following steps:
Figure BDA0002108367190000032
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
Further, the step 5 importance calculation process is as follows:
s21: heat of propagation rkThe following were used:
Figure BDA0002108367190000033
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
s22: the heat of propagation is corrected by the extent of propagation:
r′k=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a spread, gk=ls/ns,lsThe number of users sending a text in each cluster;
s23: degree of importance RkThe calculation method is as follows:
Figure BDA0002108367190000034
in the formula: s is total text set number, r'iAnd i is the propagation heat after the ith requirement is corrected, and i is a requirement item label.
Further, the relative propagation persistence in step 6 is calculated as follows:
s31: propagation persistence jkThe following were used:
Figure BDA0002108367190000035
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained;
s32: relative propagation persistence JkComprises the following steps:
Figure BDA0002108367190000041
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
Further, in step S13, the feature value is measured by the word frequency-inverse document word frequency, and the calculation method of the word frequency-inverse document word frequency TF-IDF is as follows:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
wherein:
Figure BDA0002108367190000042
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
A dynamic subway passenger demand acquisition system is characterized by comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;
the demand lexicon is used for storing demand items related to the passenger demands of the subway vehicles;
the data acquisition module is used for acquiring text data in the social network platform;
the text preprocessing module is used for preprocessing the acquired text;
the text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts;
the text clustering module is used for performing relevance clustering on the filtered text data;
the demand extraction module is used for extracting demand items in each cluster;
and the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock.
The invention has the beneficial effects that:
(1) according to the invention, a large amount of user texts are obtained through the web crawler, the passenger requirements are obtained, the user requirement obtaining efficiency is improved, and the subjectivity is low;
(2) the invention can analyze the dynamic requirements of mass users in real time and continuously capture the requirement preference of passengers, thereby acquiring the effective requirement importance of the passengers.
(3) The invention can automatically discover the emerging and potential user requirements in real time.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a diagram illustrating a result of calculating a contour coefficient according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of the system of the present invention.
FIG. 4 is a schematic diagram illustrating a trend of a change in demand of passengers according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
As shown in fig. 1, a dynamic acquiring method for subway passenger demand includes the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
and acquiring user texts from the social network platform based on the requirement word bank. The requirement word bank is a set of words related to the passenger requirements of the subway vehicle, and comprises passenger requirement items, railway vehicle product names and the like. And (3) taking the words in the requirement word bank as key words, such as subway speed, retrieving relevant user texts from the social network platform, and acquiring the text data through a network crawler technology. In the embodiment, the search is carried out by keywords such as 'wifi of subway', 'speed of subway', 'stability of subway', and the like.
The words such as the passenger requirement item (such as speed), the subway vehicle product name (such as subway) and the like stored in the requirement word library are predefined according to actual expressions, and the contents can be continuously enriched through the subsequent steps of the technical scheme of the invention.
Step 2: preprocessing the data acquired in the step 1;
the preprocessing comprises preliminary filtering, word segmentation, part-of-speech tagging and the like of the acquired text. The method comprises the following three steps:
1) and (4) formulating a filtering rule by combining the text-sending characteristics of the social platform, and preliminarily filtering the text according to the formulated rule. The filtering rule is the basis of the preliminary filtering and is written in the form of a production rule. Whether to filter is judged by analyzing whether the text contains noise characters (such as, #, [ phi ]).
2) And performing word segmentation and part-of-speech tagging on the text subjected to preliminary filtering. The word segmentation is to segment the text into words, and the part-of-speech tagging is to attach the labels such as nouns and verbs to the segmented words.
3) The filtering of the words without entity meaning comprises two parts, namely filtering the stop words such as 'the' and 'the' by combining the existing stop word list. And combining the parts of speech to filter out words except nouns, verbs and adjectives, such as adverbs, pronouns and the like.
And step 3: filtering texts irrelevant to subway passenger requirements by adopting a support vector machine classifier;
after the processing of step 1 and step 2, the noise is preliminarily filtered, but still contains a large amount of noise. This part of the noisy text appears as the main object of the description is not a subway car, but contains the keywords used in step 1 for retrieval. For example, the speed of a grand sitting on a subway warns me, and the text cannot reflect the requirement of passengers on subway vehicles. The filtering of the noise text can be regarded as performing two classifications of the text, which mainly includes the following steps:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
and randomly sampling the preprocessed text, and manually generating a training sample and a test sample. The sampling must ensure two principles, one is that the content of the sample must relate to the content retrieved by each keyword in step 1, and the number of samples in the content retrieved by each keyword is in direct proportion to the number of the content retrieved by each keyword.
S12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
and selecting characteristic words capable of identifying related texts and unrelated texts, such as 'mom' and 'rob seat', based on the training samples. The method adopting the information gain characteristic selection comprises the following steps:
the information gain is a feature selection method for determining feature words according to the information quantity contained in the words, the information quantity is represented by information entropy, and the calculation process is as follows:
Figure BDA0002108367190000061
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) for each word is calculated as follows:
Figure BDA0002108367190000062
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant and irrelevant text, respectively, and C, D is the frequency of non-occurrence of each word in the relevant and irrelevant text, respectively.
Sorting each word from big to small according to the increase of information, selecting a word with a larger value as a feature word, and obtaining a part of calculation results of the embodiment as shown in table 1:
TABLE 1 information gain value ordering
Sorting Word and phrase Information gain value
1 Robbing seat 0.9744340029
2 Mother 0.9631205685
3 Punching card 0.8819280948
4 Sprint for acupuncture 0.8529583405
5 Transfer of 0.8329984805
S13: calculating a characteristic value of the characteristic word, and expressing the text as a characteristic value vector;
the word frequency-inverse document word frequency is a characteristic value calculation method comprehensively considering the occurrence frequency (TF) of a word in a text and the occurrence frequency (IDF) of other texts, and the word frequency-inverse document word frequency calculation method comprises the following steps:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
wherein:
Figure BDA0002108367190000071
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
S14: constructing a support vector machine classifier according to the training samples, and classifying the test samples for training;
according to the test result of each test sample, the training samples are expanded to increase the coverage of the training samples to different types of noise and improve the classifier.
S15: and classifying the data into a demand related text and an irrelevant text by using the support vector classifier obtained in the step S14, and removing the irrelevant text.
And 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 3, carrying out K-means clustering on the data obtained by filtering in the step 3, and determining the optimal cluster number K through the contour coefficient.
The K-means clustering process is as follows:
the K mean value is used for classifying texts according to the distance between the texts, the distance between the texts is the correlation degree of the texts, Euclidean distance is adopted for measurement, and the distance square sum dist (S) from each point in a certain cluster to the cluster center is determinedk):
Figure BDA0002108367190000072
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
Figure BDA0002108367190000073
the objective of K-means clustering is to achieve the minimum sum of squares of distances from all samples in a cluster domain to the cluster center; the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
Figure BDA0002108367190000074
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
the contour coefficient is a coefficient for measuring the clustering result by combining two factors of cohesion and separation. The larger the contour coefficient is, the better the clustering effect is, otherwise, the worse the contour coefficient is, the following contour coefficient calculation formula is:
Figure BDA0002108367190000075
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster as it, used to quantify the degree of aggregation in the cluster, b (x)i) As a text xiAnd xiAnd traversing all other clusters according to the average distance of all texts in the other cluster to find the nearest average distance for quantifying the inter-cluster separation degree.
Determining cluster number by using average contour coefficient of whole text set, average contour coefficient L (x)kComprises the following steps:
Figure BDA0002108367190000081
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
FIG. 2 shows partial calculation results of the present invention, and it can be seen from FIG. 2 that when K is 4, the maximum average contour coefficient is obtained, i.e. the best result of K-means clustering is obtained.
And 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
extracting labels from each cluster as demand items, sequencing the word occurrence times from large to small according to the occurrence times of each word in the cluster, recommending the word occurrence times to engineers with larger occurrence times, and summarizing the labels of the type, namely the demand items, by the engineers. The partial calculation results for a cluster according to this embodiment are shown in table 2. The 'subway noise' can be selected as a requirement item.
TABLE 2 number of occurrences of words
Sorting Word and phrase Number of times of occurrence of words
1 Subway 541
2 Ear 426
3 Noise(s) 346
4 Sound equipment 312
The importance of the passenger demand item is measured by the relative heat of propagation of the demand item. The calculation formula of the propagation heat is as follows:
Figure BDA0002108367190000082
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
w1、w2and w3Respectively representing the weight of forwarding, praise and comment, and satisfying w1+w2+w3=1。
To prevent the user from repeatedly sending text, the spreading width g is usedk=ls/nsCorrecting the heat of propagation, whereinsFor the number of users who issue text in each cluster, the modified heat of propagation is expressed as:
r′k=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a broad extent of propagation,/sThe number of users sending a text in each cluster;
the calculation formula of the relative propagation heat, namely the importance is as follows:
Figure BDA0002108367190000091
in the formula: s is total text set number, r'iAnd i is the propagation heat after the ith requirement is corrected, and i is a requirement item label.
Step 6: and (5) judging whether the required item obtained in the step (5) exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, adding the required item into the required word bank, and if not, exiting.
And evaluating the obtained demand according to the propagation heat and the propagation persistence of the demand item, and judging whether the demand is a new demand. The demand items which are not in the demand thesaurus may appear in the acquired passenger demand items, and the demand items need to be judged according to the propagation heat and the propagation persistence, so as to judge whether the demand items can be supplemented to the demand thesaurus as new demand items. The method mainly comprises two steps:
1) matching the acquired requirement item with the existing requirement item in the requirement word bank, and determining whether the requirement item is a requirement item which is not in the word bank;
2) and comparing the relative propagation heat and the relative propagation persistence of the demand item with preset thresholds. The propagation persistence is a measure of the propagation persistence of the new demand.
Propagation persistence jkThe following were used:
Figure BDA0002108367190000092
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained; the retrieval is dynamic, i.e., data is automatically retrieved from the social networking platform at intervals. This time refers to the discovery of the acquisition time period, r ', of emerging, potential demand'k1、r′k2I.e. the heat of propagation for the first and second future acquisition periods, respectively.
Relative propagation persistence JkComprises the following steps:
Figure BDA0002108367190000093
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
When the relative propagation heat and the relative propagation persistence of the new demand are simultaneously greater than the set threshold, the new demand can be used as a candidate new demand and then judged manually.
According to the method, a dynamic subway passenger demand acquisition system can be constructed, and comprises a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon; the system further comprises a requirement checking module and a requirement word stock management module, wherein the requirement checking module and the requirement word stock management module are used for realizing application and maintenance of the system by engineers.
The requirement word bank is used for storing requirement items related to the requirements of subway vehicle passengers, and specifically words related to the requirements of the subway vehicle passengers.
The data acquisition module is used for acquiring text data used in the social network platform; and taking the requirement items in the requirement word bank as keywords to capture relevant text data in the social network platform. In addition, the data can be acquired in real time by setting the acquisition frequency of the module.
The text preprocessing module is used for preprocessing the acquired text; preliminarily filtering the acquired text according to the filtering method; and performing word segmentation and part-of-speech tagging on the filtered text, and filtering words without entity meaning based on the stop word list and the part-of-speech.
The text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts; obtaining the characteristic words capable of identifying the text type by using an information gain characteristic selection method, obtaining the characteristic values of the characteristic words by using a word frequency-inverse document word frequency characteristic value calculation method, vectorizing the characteristic values of each text, and outputting the filtered text by using the vectorized characteristic values as input through a support vector machine classifier.
The text clustering module is used for performing relevance clustering on the filtered text data; and the method is used for clustering the filtered texts, and determining the number of clustering clusters by adopting a K-means clustering algorithm and using an average contour coefficient.
The demand extraction module is used for extracting demand items in each cluster; for extracting passenger requirements in each cluster; the module recommends the more frequent terms to the engineer, who gives the name of the demand item, by calculating the frequency of occurrence of each term in each cluster. The importance of the demand item is determined by the relative heat of propagation.
And the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock. And comparing the relative propagation heat and the relative propagation persistence of the new requirement with a set threshold, recommending the new requirement to an engineer when the threshold is met, judging whether the new requirement is met by the engineer, storing the new requirement into a requirement word bank, and updating the word bank.
The passenger demand extraction and evaluation system can further be provided with a demand checking module and a demand word stock management module, wherein the demand checking module adopts a display to provide a visual interface so as to extract, evaluate and check the passenger demand. The requirement extraction and evaluation implementation process is the same as the corresponding steps, and is not described again. In addition, the requirement items extracted by the requirement extraction module and the calculated importance are displayed in a graph form. For example for a subway car, in the form shown in fig. 4. In the graph, a curve A is an importance change curve of wifi of a subway, a curve B is an importance change curve of stability of the subway, a curve C is an importance change curve of speed of the subway, and a curve D is an importance change curve of noise of the subway.
And the demand lexicon management module is used for maintaining the demand lexicon, continuously enriching the demand lexicon according to the acquired new demand, and modifying and deleting the demand.
The method for acquiring the passenger demand of the subway vehicle not only needs to consume a large amount of manpower, but also has higher subjectivity. A subway passenger demand dynamic acquisition method and system based on a social network platform are provided. And adopting a text mining technology in data mining to mine the texts of the social network platform users to reflect the requirements of passengers on the subway vehicles. Compared with the traditional method, the method can automatically analyze a large amount of user texts, acquire potential passenger demands, improve the user data acquisition efficiency and reduce the subjective influence. The dynamic requirements of massive users can be analyzed in real time, the passenger requirement preference can be continuously captured, the effective passenger requirement importance can be extracted, and in addition, emerging and potential user requirements can be automatically found in real time and used as driving factors for research and development of subway vehicles.

Claims (6)

1. A dynamic subway passenger demand acquisition method is characterized by comprising the following steps:
step 1: constructing a demand word bank, and acquiring user text data from a social network platform according to the demand word bank;
step 2: preprocessing the data acquired in the step 1;
and step 3: adopt support vector machine classifier to filter the text irrelevant with subway passenger's demand, specifically do:
s11: randomly sampling the text preprocessed in the step 2 to generate a training sample and a test sample;
s12: determining related texts and unrelated texts according to the training samples, respectively determining feature words of the related texts and the unrelated texts, calculating information entropy of the training samples and information gain value of each word, and taking words with gain values larger than a set threshold value as the feature words;
the training sample information entropy IG (X) calculation process is as follows:
Figure FDA0003297110840000011
in the formula: x is a training sample set, N1And N2Respectively representing the number of relevant texts and the number of irrelevant texts;
the information gain value ig (word) for each word is calculated as follows:
Figure FDA0003297110840000012
in the formula: word is the words in the training sample set, A, B is the frequency of occurrence of each word in the relevant text and irrelevant text, respectively, C, D is the frequency of non-occurrence of each word in the relevant text and irrelevant text, respectively;
s13: calculating the characteristic value of the characteristic word in each text, and expressing the text as a characteristic value vector;
s14: constructing a support vector machine classifier according to the training sample, and perfecting the classifier by using the test sample;
s15: classifying the data by using the support vector classifier obtained in the step S14 into a demand-related text and an irrelevant text, and removing the irrelevant text;
and 4, step 4: performing relevance clustering on the text filtered in the step 3 by using a K mean value clustering method for contour coefficient correction;
and 5: for each cluster in the step 4, giving a label as a demand item, and calculating the importance of the demand item;
step 6: judging whether the required item obtained in the step 5 exists in a required word bank or not, if so, exiting, otherwise, judging whether the importance and the relative propagation persistence of the required item simultaneously meet a preset threshold or not, if so, finding a new required item, adding the new required item into the required word bank, and if not, exiting;
the relative propagation persistence calculation process is as follows:
s31: propagation persistence jkThe following were used:
Figure FDA0003297110840000013
in the formula: r'k0、r′k1、r′k2Is propagation heat acquired in three consecutive time periods, wherein r'k0The obtained propagation heat is obtained;
s32: relative propagation persistence JkComprises the following steps:
Figure FDA0003297110840000021
in the formula: s is the total number of text sets, jiAnd i is the propagation persistence of the ith requirement, and is a requirement item label.
2. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said step 1 data acquisition process is as follows:
searching the words in the requirement word bank as key words in the social network platform to obtain user texts; and acquiring text data through a web crawler.
3. The method for dynamically acquiring the demand of the subway passengers as claimed in claim 1, wherein said K-means clustering method for contour coefficient modification in step 4 is to first cluster by K-means and then determine the optimal cluster number K by contour coefficient;
the K-means clustering process is as follows:
determining the distance square sum dist (S) from each point in a certain cluster to the cluster centerk):
Figure FDA0003297110840000022
In the formula: skFor each cluster of text set, xiIs SkVector of eigenvalues of text in clusters, nsIs SkNumber of texts in a cluster, ukIs SkA cluster center of the cluster, i is a text label in the cluster;
wherein u iskThe following were used:
Figure FDA0003297110840000023
the sum of squares dist(s) of the distances of all samples in the cluster domain to the cluster center is:
Figure FDA0003297110840000024
in the formula: k is the cluster number of clusters, S is the total text set number, and j is the label of each cluster in the text set;
contour coefficient L (x)i) The following were used:
Figure FDA0003297110840000025
in the formula: a (x)i) As a text xiAverage of all other text distances in the same cluster, b (x)i) As a text xiAnd xiAverage distance of all texts in the other cluster;
average contour coefficient L (x)kComprises the following steps:
Figure FDA0003297110840000031
in the formula: n is the text number of the whole text set;
and when the average contour coefficient is maximum, the corresponding cluster number k is the optimal cluster number.
4. The method of claim 1, wherein the step 5 importance calculation process comprises the following steps:
s21: heat of propagation rkThe following were used:
Figure FDA0003297110840000032
in the formula: n issFor the number of texts in each cluster, ZiFor the forwarding amount of the ith text in each cluster, DiPerform like amount, P, of ith text in each clusteriAmount of comments for the ith text in each cluster, w1、w2And w3Is constant, k is the number of clusters;
s22: the heat of propagation is corrected by the extent of propagation:
rk′=rk×gk
in the formula: r'kFor corrected heat of propagation, gkTo a spread, gk=ls/ns,lsThe number of users sending a text in each cluster;
s23: degree of importance RkThe calculation method is as follows:
Figure FDA0003297110840000033
in the formula: s is the total text set number, ri' is the propagation heat corrected by the requirement of item i, and i is the requirement item label.
5. The method of claim 1, wherein the feature values in step S13 are measured by a term frequency-inverse document term frequency, and the term frequency-inverse document term frequency TF-IDF calculation method is as follows:
TF-IDF(word)=TF(word)×IDF(word)
in the formula: TF (word) is the frequency of the words appearing in one text, IDF (word) is the frequency of the words appearing in other texts, TF (word) is the frequency of a word appearing in one text, and IDF (word) is the frequency of an inverse document of a word appearing in a text set;
wherein:
Figure FDA0003297110840000034
in the formula: w (word) is the number of times a word appears in a text, W is the total number of words in the text where the word is located, F is the total number of words in the training sample, and F (word) is the number of times a word appears in the training sample.
6. The system for acquiring the dynamic subway passenger demand acquisition method according to any one of claims 1 to 5, comprising a data acquisition model, a text preprocessing module, a text filtering module, a text clustering module, a demand extraction module, a new demand evaluation module and a demand lexicon;
the demand lexicon is used for storing demand items related to the passenger demands of the subway vehicles;
the data acquisition module is used for acquiring text data in the social network platform;
the text preprocessing module is used for preprocessing the acquired text;
the text filtering module is used for filtering out texts irrelevant to passenger requirements from the texts;
the text clustering module is used for performing relevance clustering on the filtered text data;
the demand extraction module is used for extracting demand items in each cluster;
and the new demand evaluation module is used for judging whether the demand item is contained in the demand word stock and updating the demand word stock.
CN201910561357.XA 2019-06-26 2019-06-26 Subway passenger demand dynamic acquisition method and acquisition system thereof Active CN110347828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910561357.XA CN110347828B (en) 2019-06-26 2019-06-26 Subway passenger demand dynamic acquisition method and acquisition system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910561357.XA CN110347828B (en) 2019-06-26 2019-06-26 Subway passenger demand dynamic acquisition method and acquisition system thereof

Publications (2)

Publication Number Publication Date
CN110347828A CN110347828A (en) 2019-10-18
CN110347828B true CN110347828B (en) 2022-03-15

Family

ID=68183218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910561357.XA Active CN110347828B (en) 2019-06-26 2019-06-26 Subway passenger demand dynamic acquisition method and acquisition system thereof

Country Status (1)

Country Link
CN (1) CN110347828B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297401A (en) * 2021-12-14 2022-04-08 中航机载系统共性技术有限公司 System knowledge extraction method based on clustering algorithm
CN114445141A (en) * 2022-01-26 2022-05-06 西南交通大学 Customer demand obtaining method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909478A (en) * 2017-11-27 2018-04-13 苏州点对点信息科技有限公司 FOF mutual fund portfolio system and methods based on social network clustering and information gain entropy index
CN107908753A (en) * 2017-11-20 2018-04-13 合肥工业大学 Customer demand method for digging and device based on social media comment data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011072125A2 (en) * 2009-12-09 2011-06-16 Zemoga, Inc. Method and apparatus for real time semantic filtering of posts to an internet social network
US20130080212A1 (en) * 2011-09-26 2013-03-28 Xerox Corporation Methods and systems for measuring engagement effectiveness in electronic social media
CN103678564B (en) * 2013-12-09 2017-02-15 国家计算机网络与信息安全管理中心 Internet product research system based on data mining
CN104484343B (en) * 2014-11-26 2017-11-03 无锡清华信息科学与技术国家实验室物联网技术中心 It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging
CN108388660B (en) * 2018-03-08 2021-10-01 中国计量大学 Improved E-commerce product pain point analysis method
CN109165996B (en) * 2018-07-18 2022-02-11 浙江大学 Product functional feature importance analysis method based on online user comments
CN109829166B (en) * 2019-02-15 2022-12-27 重庆师范大学 People and host customer opinion mining method based on character-level convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908753A (en) * 2017-11-20 2018-04-13 合肥工业大学 Customer demand method for digging and device based on social media comment data
CN107909478A (en) * 2017-11-27 2018-04-13 苏州点对点信息科技有限公司 FOF mutual fund portfolio system and methods based on social network clustering and information gain entropy index

Also Published As

Publication number Publication date
CN110347828A (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN110837931B (en) Customer churn prediction method, device and storage medium
US10579646B2 (en) Systems and methods for classifying electronic documents
CN103927675B (en) Judge the method and device of age of user section
CN108733816B (en) Microblog emergency detection method
Coussement et al. Improving customer complaint management by automatic email classification using linguistic style features as predictors
EP3279804A1 (en) Data analysis system, data analysis method, data analysis program, and recording medium
CN103309862B (en) Webpage type recognition method and system
CN108038627B (en) Object evaluation method and device
CN104077407B (en) A kind of intelligent data search system and method
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN110347828B (en) Subway passenger demand dynamic acquisition method and acquisition system thereof
CN105225135B (en) Potential customer identification method and device
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN111079941A (en) Credit information system combining expert experience model and supervised machine learning algorithm
CN110134799A (en) A kind of text corpus based on BM25 algorithm build and optimization method
CN112418695A (en) Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field
CN115099310A (en) Method and device for training model and classifying enterprises
Ge et al. Measure and Mitigate the Dimensional Bias in Online Reviews and Ratings.
CN112632958A (en) Contract document examination and analysis method based on contract knowledge base
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN114445141A (en) Customer demand obtaining method
Han et al. User requirements dynamic elicitation of complex products from social network service
Liu et al. Identification of subway track irregularities based on detection data of portable detector
JP6320353B2 (en) Digital marketing system
KR102078541B1 (en) Issue interest based news value evaluation apparatus and method, storage media storing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant