CN104462544A - Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method - Google Patents

Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method Download PDF

Info

Publication number
CN104462544A
CN104462544A CN201410820591.7A CN201410820591A CN104462544A CN 104462544 A CN104462544 A CN 104462544A CN 201410820591 A CN201410820591 A CN 201410820591A CN 104462544 A CN104462544 A CN 104462544A
Authority
CN
China
Prior art keywords
entry
classification
search
high frequency
passengers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410820591.7A
Other languages
Chinese (zh)
Inventor
王勇
康强
王志刚
赵晓光
张元庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DALIAN SEASKY AUTOMATION Co Ltd
Original Assignee
DALIAN SEASKY AUTOMATION Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DALIAN SEASKY AUTOMATION Co Ltd filed Critical DALIAN SEASKY AUTOMATION Co Ltd
Priority to CN201410820591.7A priority Critical patent/CN104462544A/en
Publication of CN104462544A publication Critical patent/CN104462544A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information, relates to text statistics, text classification and distributed calculation and provides a passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method. The passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method mainly comprises two stages, wherein statistics of texts appeared at high frequency is performed at the first stage, namely a parallel distributed statistics method is adopted to conduct statistics on a plurality of search entries appeared at high frequency according to daily search records of passengers, and the search entries are regarded as certainly-popular information currently needed by the passengers; at the second stage, reasonable classification is conducted on the entries searched at high frequency, a reasonable classification method comprises the specific steps of firstly marking a large amount of commonly-used search keywords to establish a sample bank, secondly establishing a text classification model based on the sample bank and finally transmitting the entries searched at high frequency, obtained through statistics, to the classification model to be classified, and the classification purpose aims at facilitating video uploading and browsing and downloading of the passengers.

Description

A kind of subway towards passenger demand/high ferro onboard servers video update method
Technical field
The invention belongs to areas of information technology, relating to text statistics, text classification and Distributed Calculation, is a kind of onboard servers video update method of the subway/high ferro towards passenger demand.The inventive method is implemented mainly to comprise two stages, first stage is the statistics that text appears in high frequency, namely daily according to passenger searching record, adopt the statistical method of parallel distributed, some search entry that statistics frequency of occurrence is higher, these search entry are considered to the information having certain temperature of passenger's current demand; Subordinate phase carries out Rational Classification to high frequency search entry, concrete grammar is first mark to build Sample Storehouse to a large amount of conventional search keyword, secondly textual classification model is set up based on this Sample Storehouse, finally the high frequency search entry that statistics obtains is flowed to disaggregated model to classify to it, the object of classification is the browsing and download with passenger of uploading facilitating video.The inventive method can improve the service quality of track traffic, ensures that passenger can obtain its video wanted from onboard servers as far as possible, and the inventive method can also the renewal of other data in onboard servers.
Background technology
In recent years, track traffic has become one of main vehicles in modern city, and the effect of track traffic in urban modernization, Intelligent Process is more and more important.The Rail Transit System of hommization be unable to do without perfect service system, and inner in track traffic can the Vehicular video information of view for free/downloads be the important means of raising track traffic service quality for passenger provides.But can not blindly for passenger provides video information, uploading of video information needs to consider the needs of client and the temperature of information itself.
How to ensure the temperature of video, audio frequency and the data message etc. provided and ageing, make passenger satisfied, survey is infeasible certainly, because this task that to be a workload huge, and is difficult to the coverage rate that ensures to investigate.It is considered herein that and by problem arises to text mining aspect, namely can obtain relevant information from the searching record of rider history, text mining is carried out to the searching record information of passenger, finds out the actual demand of passenger.The research that the domestic and international video information for onboard servers upgrades at present does not also occur, but is all studying for a lot of scholar of text mining, and wherein most typical text mining research, should belong to the text mining on internet.Information on internet be very huge and upgrade rapidly, people in the urgent need to finding resource and knowledge quickly and efficiently from web, to improve the efficiency of information searching and utilization on web, improve and organize the result for retrieval on web, carry out personalized service etc., these all be unable to do without and excavate (R.Etemadi, N.Moghaddam.An approach in web content mining for clustering web pages.Proceedingsof the 5 to the content of text on web thinternaltional Conference on Digital Information Management.Thunder Bay, ON:IEEE, 2010:279-284).Except internet, text mining is also widely used in stock, securities market, by summing up relevant politics, Economic News, excavate keyword relevant to stock market's ups and downs in politics and economy class news, add up the change that stock market in situation appears in this class keywords, (B.Wuthrich can be predicted to the change of stock market quotes, D.Permunetilleke, S.Leung, et al.Daily prediction of major stock indices from textualWWW data.New York:Proceedings of the 4th Internaltional Conference on KnowledgeDiscovery, 1998).In recent years, text mining is also applied to tcm field by the domestic scholar of having, (Ji Hangyu is excavated to the consumption of Chinese medicine and strategy, Jiao Yongzheng, Lian Fengmei etc., the text mining research of treatise on Febrile Diseases and " Synopsis Golden Chamber " consumption strategy, Chinese traditional Chinese medicine magazine, 27 (1): 16-19,2012.).Visible, text mining is the effective ways excavated Chinese text and unstructured data.
The very important problem of text mining one of relating to is exactly the problem of amount of text, and when amount of text is huge, high-precision text mining is difficult.Because be difficult to the study and the excavation that text are had to supervision, what in most cases adopt is semi-supervised method for digging, and inevitably reduces the requirement to excavating precision for semi-supervised method for digging.For the excavation of the searching record text to passenger studied in the present invention, amount of text is huge certainly, and current already present method is difficult to directly excavate text in the high-precision situation of guarantee.
Summary of the invention
The technical problem to be solved in the present invention is subway/high ferro onboard servers video replacement problem.For addressing this problem, the demand of passenger should be analyzed, in view of the mobility of passenger is larger, the enormous amount of passenger, be difficult to be directly acquainted with passenger demand, it is considered herein that and and record can be browsed carry out text mining, to determine the demand of passenger the search of passenger on network.In view of the searching record amount of text of passenger is huge, the present invention proposes a kind of two stage text mining thinking, and the first stage is first added up the frequency that search entry different in text occurs, to obtain high frequency search entry wherein; Subordinate phase carries out taxonomic revision to high frequency entry.
As shown in Figure 1, the concrete steps of technical solution of the present invention are as follows for the overall realization flow of technical solution of the present invention:
1. the text required for obtaining, these texts can be searched plain historical record to obtain by user on some search engines and video website;
2. under Hadoop platform, the vocabulary adopting distributed statistical method higher to frequency of occurrence in text is added up, to obtain high frequency vocabulary;
3. based on Naive Bayes Classifier, taxonomic revision is carried out to the high frequency vocabulary after screening;
4. examine the classification results of sorter in step 3, manual sorting is carried out for other deviation extremely individual, and the corresponding video file of all high frequency entry institutes is uploaded in onboard servers.
Effect of the present invention and benefit are:
First the benefit of the inventive method is to upgrade the video in onboard servers targetedly, to meet the demand of passenger as much as possible, secondly the inventive method has higher operation efficiency, the major embodiment vocabulary that under have employed Hadoop platform in the present invention, distributed statistic algorithm counting user search rate is higher, Hadoop platform has the ability of parallel and distributed process, improves the efficiency of algorithm, finally, compared to traditional direct mode of the vocabulary in text library being carried out to classification process, the inventive method is rationalized more, because the search history record on network is too huge, it is very difficult for directly carrying out classification process to mass text vocabulary, sorting technique based on supervised hardly may, and adopt the sorting technique of Semi-supervised inevitably to have tremendous influence to the effect of classification, and first the present invention adds up high frequency vocabulary, ignore those and there is no temperature and not concerned vocabulary, the scale of vocabulary is reduced to hundreds of, this just can classify based on the high frequency vocabulary formula that exercises supervision, reasonable design method, assess the cost low.
Accompanying drawing explanation
Fig. 1 is for realizing overall flow figure of the present invention.
Fig. 2 is the carriage structure schematic diagram of certain track train.
Fig. 3 is Map function implementation result figure used during statistics high frequency vocabulary.
Fig. 4 is Reduce function implementation result figure used during statistics high frequency vocabulary.
Fig. 5 is the entry classification process figure based on Naive Bayes Classifier.
Embodiment
In order to understand technical scheme of the present invention better, below in conjunction with accompanying drawing, technical solution of the present invention is described in detail.Figure 2 shows that a schematic diagram in certain joint compartment of train, configure two-server in compartment, the hard disk of the built-in one piece of 500G of every station server.Along with the progress of science and technology, the storage space of built-in hard disk may be increasing.So this two-server just has very large memory space, can store a large amount of video informations, is being free browsing and downloading by bus for passenger.Understand the real demand of passenger, ensure the temperature of video information and ageing the service quality improving track traffic to be very important.To achieve these goals, the invention provides a kind of subway towards passenger demand/high ferro onboard servers video update method.According to accompanying drawing 1, concrete steps of the present invention are as follows:
Step 1: the searching record text required for acquisition, these texts can be searched plain historical record to obtain by user on some search engines and video website, and the content of text is made up of some search entry.
The high frequency word statistics based on MapReduce framework under step 2:Hadoop platform
In hadoop, a MapReduce task is divided into two stages: Map stage and Reduce stage.These two stages represent with two functions respectively, i.e. Map function and Reduce function.
The Map stage: be some parts by searching record text segmentation, as the input of Map function, as shown in Figure 3, all entries in received literary composition travel through by each Map function, as long as each entry occurs just carrying out marking once, the output of final Map function is the form of key-value pair, as < temptation, 1>, wherein temptation is major key, and 1 is value.Like this, each Map function exports one group of key-value pair.
The Map stage can be exported the value set with identical major key and pass to Reduce function together by Reduce stage: Hadoop, Reduce function receives a form as < major key, the input of value collection >, form < red sorghum → 1 of the key assignments set namely shown in Fig. 4, 1, 1, 1>, Reduce function is responsible for processing value collection, output is also the form of key-value pair, < red sorghum as shown in Figure 3, 4>, there are four passengers in other words at search red sorghum.
So just be relatively easy to have counted the number of times that different entry occurs in search history record, as in Fig. 4, " red sorghum " and " griggles " has occurred 4 times, " transformer " and " acquired immune deficiency syndrome (AIDS) day " has occurred 3 times, then sort algorithm can be adopted to sort to the number of times that search entry occurs, so just can filter out the popular entry that a part of frequency of occurrence is higher.Retain how many entry needs to determine according to the memory space of onboard servers, as every portion film of two hours is about 1.5G, so the hard-disk capacity of 500G can store 320 multi-section films.Below just lifted a simple example, when practical operation, the scale of text and entry is all huge.
Step 3: adopt Naive Bayes Classifier that the high frequency entry of step 2 gained is carried out taxonomic revision
Naive Bayes Classification is a kind of simple probabilistic classification method, is often applied to text mining, but when implementing the probability of demand fulfillment additive postulate and word item independent of its position in the text and context.And high frequency entry classification of arriving involved in the present invention, the position of the probability that certain Feature Words in entry occurs and this Feature Words and context do not have inevitable contacting, and independent hypothesis is set up completely.High frequency entry is carried out classifying and mainly comprises the following steps:
(1) mark entry, build training sample set:
Mark entry is to build training sample, therefore marks entry, marks the classification for the entry of training exactly.Be using the historical storage content in server as training sample in the present invention, be exactly total there is a benefit directly can determine different classes of ratio shared by storage space like this, and the classification information of each entry is clear, does not need to mark specially.As in whole storage space, political class video, sport category video, recreational video and educational video proportion set in advance.
(2) Bayesian classifier model is set up:
To distribute separate precondition based on the feature in entry, Naive Bayes Classifier can represent by following mathematical form:
P ( c i | d , &theta; ) = P ( c i | &theta; ) P ( d | c i , &theta; ) P ( d | &theta; ) = P ( c i | &theta; ) &Pi; j = 1 m P ( f j | c i , &theta; ) n j P ( d | &theta; ) - - - ( 1 )
Wherein, c ifor class label, d represents the entry of search, and θ represents the parameter relevant to prior probability.Here entry d is made up of several lexical items, f jrepresent lexical item, m is lexical item number, as entry " Venezuela intend to China payment of a debt island exposure " can be divided into several lexical items " Venezuela, intend to, China, payment of a debt, island, exposure ", such m=6; Also have the entry of some film and television plays, lexical item may only have 1, as " fighting secretly ".
For different classifications, the denominator P (d| θ) in above-mentioned formula is constant, remembers that total classification number is C, and so P (d| θ) can be expressed as
P ( d | &theta; ) = &Sigma; i = 1 C ( P ( c i | &theta; ) &Pi; j = 1 m P ( f j | c i , &theta; ) n j ) - - - ( 2 )
If thought at P (c i| d, θ) value maximum time class label be exactly the target classification of entry d, as long as so select the classification making formula (1) molecule maximum, that is this need further to determine prior probability P (c i| θ) and likelihood function item in P (f j| c i, θ) estimated value.Under present case, training sample is known, can only determine prior probability and likelihood function, namely by training sample
P(c i|θ)=N i/N (3)
P ( f j | c i , &theta; ) = 1 + n i , j m + &Sigma; k = 1 m n k , i - - - ( 4 )
Wherein, N irepresent that training sample is concentrated and belong to classification c ientry number, N represent training sample concentrate all entry numbers, n i,jrepresent that training sample concentrates feature j at classification c ithe number of times of middle appearance, m is the lexical item number in entry d.
(3) compare the probability that new term belongs to all classes, this entry is assigned in that maximum classification of probability.
Step 4: check and arrangement
Examine the classification results of sorter in step 3, carry out manual sorting for other deviation extremely individual, and upload in onboard servers by the corresponding video file category of all high frequency entry institutes, in whole onboard servers, the renewal process of video is complete.

Claims (1)

1., towards subway/high ferro onboard servers video update method of passenger demand, it is characterized in that comprising the following steps:
Step one, the searching record text required for acquisition, these texts are searched plain historical record to obtain on some search engines and video website by user, and the content of text is made up of some search entry;
Step 2, under Hadoop platform, based on MapReduce framework statistics high frequency entry, comprises two stages:
The Map stage: be some parts by searching record text segmentation, as the input of Map function, all entries in received literary composition travel through by each Map function, as long as each entry occurs just carrying out marking once, the output of final Map function is one group of key-value pair;
The Map stage can be exported the value set with identical major key and pass to Reduce function together by Reduce stage: Hadoop, Reduce function receives a form as < major key, the input of value collection >, Reduce function is responsible for processing value collection, and output is also the form of key-value pair;
Count the number of times that different entry occurs in search history record, then adopt sort algorithm to sort to the number of times that search entry occurs, filter out the popular entry that a part of frequency of occurrence is higher;
Step 3, adopts Naive Bayes Classifier that the high frequency entry of step 2 gained is carried out taxonomic revision
First, mark entry, builds training sample set;
Secondly, Bayesian classifier model is set up:
To distribute separate precondition based on the feature in entry, the following form of Naive Bayes Classifier represents:
P ( c i | d , &theta; ) = P ( c i | &theta; ) &Pi; j = 1 m P ( f j | c i , &theta; ) n j P ( d | &theta; ) - - - ( 1 )
Wherein, c ifor class label, d represents the entry of search, and θ represents the parameter relevant to prior probability; Entry d is made up of several lexical items, f jrepresent lexical item, m is lexical item number;
For different classifications, the denominator P (d| θ) in above-mentioned formula is constant, as long as select the classification making formula (1) molecule maximum as target classification; Under present case, training sample is known, determines prior probability P (c by training sample i| θ) and likelihood function item in P (f j| c i, θ), namely
P(c i|θ)=N i/N (2)
P ( f j | c i , &theta; ) = 1 + n i , j m + &Sigma; k = 1 m n k , i - - - ( 3 )
Wherein, N irepresent that training sample is concentrated and belong to classification c ientry number, N represent training sample concentrate all entry numbers,
N i,jrepresent that training sample concentrates feature j at classification c ithe number of times of middle appearance, m is the lexical item number in entry d;
Finally, compare the probability that new term belongs to all classes, this entry is assigned in that maximum classification of probability;
Step 4, examines the classification results of sorter in step 3, carries out manual sorting for other deviation extremely individual, and uploads in onboard servers by the corresponding video file category of all high frequency entry institutes, and in whole onboard servers, the renewal process of video is complete.
CN201410820591.7A 2014-12-24 2014-12-24 Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method Pending CN104462544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410820591.7A CN104462544A (en) 2014-12-24 2014-12-24 Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410820591.7A CN104462544A (en) 2014-12-24 2014-12-24 Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method

Publications (1)

Publication Number Publication Date
CN104462544A true CN104462544A (en) 2015-03-25

Family

ID=52908579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410820591.7A Pending CN104462544A (en) 2014-12-24 2014-12-24 Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method

Country Status (1)

Country Link
CN (1) CN104462544A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765875A (en) * 2015-04-24 2015-07-08 海南易建科技股份有限公司 Distributed processing method and system for passenger behavior data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120188365A1 (en) * 2009-07-20 2012-07-26 Precitec Kg Laser processing head and method for compensating for the change in focus position in a laser processing head
CN102752663A (en) * 2012-07-18 2012-10-24 青岛海信信芯科技有限公司 Program searching device, television and program searching method
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120188365A1 (en) * 2009-07-20 2012-07-26 Precitec Kg Laser processing head and method for compensating for the change in focus position in a laser processing head
CN102752663A (en) * 2012-07-18 2012-10-24 青岛海信信芯科技有限公司 Program searching device, television and program searching method
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吕雪骥: "基于云计算平台的智能推荐系统研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
徐军等: "使用机器学习方法进行新闻的情感自动分类", 《中文信息学报》 *
阚庭明: "城市轨道交通乘客信息系统关键技术研究", 《中国博士学位论文全文数据库工程科技II辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765875A (en) * 2015-04-24 2015-07-08 海南易建科技股份有限公司 Distributed processing method and system for passenger behavior data
CN104765875B (en) * 2015-04-24 2016-09-28 海南易建科技股份有限公司 A kind of passenger's behavior data distributed approach and system

Similar Documents

Publication Publication Date Title
US9449271B2 (en) Classifying resources using a deep network
Jiang et al. Fast and accurate content-based semantic search in 100m internet videos
CN103020159A (en) Method and device for news presentation facing events
CN102708096A (en) Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN104504150A (en) News public opinion monitoring system
US20160080476A1 (en) Meme discovery system
Yi et al. Flight delay classification prediction based on stacking algorithm
Psomakelis et al. Big IoT and social networking data for smart cities: Algorithmic improvements on Big Data Analysis in the context of RADICAL city applications
Alhumoud Twitter analysis for intelligent transportation
CN104536830A (en) KNN text classification method based on MapReduce
Xia et al. A parallel grid-search-based SVM optimization algorithm on Spark for passenger hotspot prediction
CN111414471B (en) Method and device for outputting information
CN106204103A (en) The method of similar users found by a kind of moving advertising platform
Ahmed et al. Real-time traffic congestion information from tweets using supervised and unsupervised machine learning techniques
CN113742496B (en) Electric power knowledge learning system and method based on heterogeneous resource fusion
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN111428502A (en) Named entity labeling method for military corpus
Chen et al. Two-stage solar flare forecasting based on convolutional neural networks
CN105117466A (en) Internet information screening system and method
CN113222109A (en) Internet of things edge algorithm based on multi-source heterogeneous data aggregation technology
Yang et al. Microblog sentiment analysis algorithm research and implementation based on classification
CN104462544A (en) Passengers&#39; demand oriented metro/high-speed rail vehicle-mounted server video updating method
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN105608106A (en) Intelligent terminal-oriented public opinion analysis method
CN115759253A (en) Power grid operation and maintenance knowledge map construction method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150325