CN104462544A

CN104462544A - Passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method

Info

Publication number: CN104462544A
Application number: CN201410820591.7A
Authority: CN
Inventors: 王勇; 康强; 王志刚; 赵晓光; 张元庆
Original assignee: DALIAN SEASKY AUTOMATION Co Ltd
Current assignee: DALIAN SEASKY AUTOMATION Co Ltd
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2015-03-25

Abstract

The invention belongs to the technical field of information, relates to text statistics, text classification and distributed calculation and provides a passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method. The passengers' demand oriented metro/high-speed rail vehicle-mounted server video updating method mainly comprises two stages, wherein statistics of texts appeared at high frequency is performed at the first stage, namely a parallel distributed statistics method is adopted to conduct statistics on a plurality of search entries appeared at high frequency according to daily search records of passengers, and the search entries are regarded as certainly-popular information currently needed by the passengers; at the second stage, reasonable classification is conducted on the entries searched at high frequency, a reasonable classification method comprises the specific steps of firstly marking a large amount of commonly-used search keywords to establish a sample bank, secondly establishing a text classification model based on the sample bank and finally transmitting the entries searched at high frequency, obtained through statistics, to the classification model to be classified, and the classification purpose aims at facilitating video uploading and browsing and downloading of the passengers.

Description

A kind of subway towards passenger demand/high ferro onboard servers video update method

Technical field

The invention belongs to areas of information technology, relating to text statistics, text classification and Distributed Calculation, is a kind of onboard servers video update method of the subway/high ferro towards passenger demand.The inventive method is implemented mainly to comprise two stages, first stage is the statistics that text appears in high frequency, namely daily according to passenger searching record, adopt the statistical method of parallel distributed, some search entry that statistics frequency of occurrence is higher, these search entry are considered to the information having certain temperature of passenger's current demand; Subordinate phase carries out Rational Classification to high frequency search entry, concrete grammar is first mark to build Sample Storehouse to a large amount of conventional search keyword, secondly textual classification model is set up based on this Sample Storehouse, finally the high frequency search entry that statistics obtains is flowed to disaggregated model to classify to it, the object of classification is the browsing and download with passenger of uploading facilitating video.The inventive method can improve the service quality of track traffic, ensures that passenger can obtain its video wanted from onboard servers as far as possible, and the inventive method can also the renewal of other data in onboard servers.

Background technology

In recent years, track traffic has become one of main vehicles in modern city, and the effect of track traffic in urban modernization, Intelligent Process is more and more important.The Rail Transit System of hommization be unable to do without perfect service system, and inner in track traffic can the Vehicular video information of view for free/downloads be the important means of raising track traffic service quality for passenger provides.But can not blindly for passenger provides video information, uploading of video information needs to consider the needs of client and the temperature of information itself.

How to ensure the temperature of video, audio frequency and the data message etc. provided and ageing, make passenger satisfied, survey is infeasible certainly, because this task that to be a workload huge, and is difficult to the coverage rate that ensures to investigate.It is considered herein that and by problem arises to text mining aspect, namely can obtain relevant information from the searching record of rider history, text mining is carried out to the searching record information of passenger, finds out the actual demand of passenger.The research that the domestic and international video information for onboard servers upgrades at present does not also occur, but is all studying for a lot of scholar of text mining, and wherein most typical text mining research, should belong to the text mining on internet.Information on internet be very huge and upgrade rapidly, people in the urgent need to finding resource and knowledge quickly and efficiently from web, to improve the efficiency of information searching and utilization on web, improve and organize the result for retrieval on web, carry out personalized service etc., these all be unable to do without and excavate (R.Etemadi, N.Moghaddam.An approach in web content mining for clustering web pages.Proceedingsof the 5 to the content of text on web ^thinternaltional Conference on Digital Information Management.Thunder Bay, ON:IEEE, 2010:279-284).Except internet, text mining is also widely used in stock, securities market, by summing up relevant politics, Economic News, excavate keyword relevant to stock market's ups and downs in politics and economy class news, add up the change that stock market in situation appears in this class keywords, (B.Wuthrich can be predicted to the change of stock market quotes, D.Permunetilleke, S.Leung, et al.Daily prediction of major stock indices from textualWWW data.New York:Proceedings of the 4th Internaltional Conference on KnowledgeDiscovery, 1998).In recent years, text mining is also applied to tcm field by the domestic scholar of having, (Ji Hangyu is excavated to the consumption of Chinese medicine and strategy, Jiao Yongzheng, Lian Fengmei etc., the text mining research of treatise on Febrile Diseases and " Synopsis Golden Chamber " consumption strategy, Chinese traditional Chinese medicine magazine, 27 (1): 16-19,2012.).Visible, text mining is the effective ways excavated Chinese text and unstructured data.

The very important problem of text mining one of relating to is exactly the problem of amount of text, and when amount of text is huge, high-precision text mining is difficult.Because be difficult to the study and the excavation that text are had to supervision, what in most cases adopt is semi-supervised method for digging, and inevitably reduces the requirement to excavating precision for semi-supervised method for digging.For the excavation of the searching record text to passenger studied in the present invention, amount of text is huge certainly, and current already present method is difficult to directly excavate text in the high-precision situation of guarantee.

Summary of the invention

The technical problem to be solved in the present invention is subway/high ferro onboard servers video replacement problem.For addressing this problem, the demand of passenger should be analyzed, in view of the mobility of passenger is larger, the enormous amount of passenger, be difficult to be directly acquainted with passenger demand, it is considered herein that and and record can be browsed carry out text mining, to determine the demand of passenger the search of passenger on network.In view of the searching record amount of text of passenger is huge, the present invention proposes a kind of two stage text mining thinking, and the first stage is first added up the frequency that search entry different in text occurs, to obtain high frequency search entry wherein; Subordinate phase carries out taxonomic revision to high frequency entry.

As shown in Figure 1, the concrete steps of technical solution of the present invention are as follows for the overall realization flow of technical solution of the present invention:

1. the text required for obtaining, these texts can be searched plain historical record to obtain by user on some search engines and video website;

2. under Hadoop platform, the vocabulary adopting distributed statistical method higher to frequency of occurrence in text is added up, to obtain high frequency vocabulary;

3. based on Naive Bayes Classifier, taxonomic revision is carried out to the high frequency vocabulary after screening;

4. examine the classification results of sorter in step 3, manual sorting is carried out for other deviation extremely individual, and the corresponding video file of all high frequency entry institutes is uploaded in onboard servers.

Effect of the present invention and benefit are:

First the benefit of the inventive method is to upgrade the video in onboard servers targetedly, to meet the demand of passenger as much as possible, secondly the inventive method has higher operation efficiency, the major embodiment vocabulary that under have employed Hadoop platform in the present invention, distributed statistic algorithm counting user search rate is higher, Hadoop platform has the ability of parallel and distributed process, improves the efficiency of algorithm, finally, compared to traditional direct mode of the vocabulary in text library being carried out to classification process, the inventive method is rationalized more, because the search history record on network is too huge, it is very difficult for directly carrying out classification process to mass text vocabulary, sorting technique based on supervised hardly may, and adopt the sorting technique of Semi-supervised inevitably to have tremendous influence to the effect of classification, and first the present invention adds up high frequency vocabulary, ignore those and there is no temperature and not concerned vocabulary, the scale of vocabulary is reduced to hundreds of, this just can classify based on the high frequency vocabulary formula that exercises supervision, reasonable design method, assess the cost low.

Accompanying drawing explanation

Fig. 1 is for realizing overall flow figure of the present invention.

Fig. 2 is the carriage structure schematic diagram of certain track train.

Fig. 3 is Map function implementation result figure used during statistics high frequency vocabulary.

Fig. 4 is Reduce function implementation result figure used during statistics high frequency vocabulary.

Fig. 5 is the entry classification process figure based on Naive Bayes Classifier.

Embodiment

In order to understand technical scheme of the present invention better, below in conjunction with accompanying drawing, technical solution of the present invention is described in detail.Figure 2 shows that a schematic diagram in certain joint compartment of train, configure two-server in compartment, the hard disk of the built-in one piece of 500G of every station server.Along with the progress of science and technology, the storage space of built-in hard disk may be increasing.So this two-server just has very large memory space, can store a large amount of video informations, is being free browsing and downloading by bus for passenger.Understand the real demand of passenger, ensure the temperature of video information and ageing the service quality improving track traffic to be very important.To achieve these goals, the invention provides a kind of subway towards passenger demand/high ferro onboard servers video update method.According to accompanying drawing 1, concrete steps of the present invention are as follows:

Step 1: the searching record text required for acquisition, these texts can be searched plain historical record to obtain by user on some search engines and video website, and the content of text is made up of some search entry.

The high frequency word statistics based on MapReduce framework under step 2:Hadoop platform

In hadoop, a MapReduce task is divided into two stages: Map stage and Reduce stage.These two stages represent with two functions respectively, i.e. Map function and Reduce function.

The Map stage: be some parts by searching record text segmentation, as the input of Map function, as shown in Figure 3, all entries in received literary composition travel through by each Map function, as long as each entry occurs just carrying out marking once, the output of final Map function is the form of key-value pair, as < temptation, 1>, wherein temptation is major key, and 1 is value.Like this, each Map function exports one group of key-value pair.

The Map stage can be exported the value set with identical major key and pass to Reduce function together by Reduce stage: Hadoop, Reduce function receives a form as < major key, the input of value collection >, form < red sorghum → 1 of the key assignments set namely shown in Fig. 4, 1, 1, 1>, Reduce function is responsible for processing value collection, output is also the form of key-value pair, < red sorghum as shown in Figure 3, 4>, there are four passengers in other words at search red sorghum.

So just be relatively easy to have counted the number of times that different entry occurs in search history record, as in Fig. 4, " red sorghum " and " griggles " has occurred 4 times, " transformer " and " acquired immune deficiency syndrome (AIDS) day " has occurred 3 times, then sort algorithm can be adopted to sort to the number of times that search entry occurs, so just can filter out the popular entry that a part of frequency of occurrence is higher.Retain how many entry needs to determine according to the memory space of onboard servers, as every portion film of two hours is about 1.5G, so the hard-disk capacity of 500G can store 320 multi-section films.Below just lifted a simple example, when practical operation, the scale of text and entry is all huge.

Step 3: adopt Naive Bayes Classifier that the high frequency entry of step 2 gained is carried out taxonomic revision

Naive Bayes Classification is a kind of simple probabilistic classification method, is often applied to text mining, but when implementing the probability of demand fulfillment additive postulate and word item independent of its position in the text and context.And high frequency entry classification of arriving involved in the present invention, the position of the probability that certain Feature Words in entry occurs and this Feature Words and context do not have inevitable contacting, and independent hypothesis is set up completely.High frequency entry is carried out classifying and mainly comprises the following steps:

(1) mark entry, build training sample set:

Mark entry is to build training sample, therefore marks entry, marks the classification for the entry of training exactly.Be using the historical storage content in server as training sample in the present invention, be exactly total there is a benefit directly can determine different classes of ratio shared by storage space like this, and the classification information of each entry is clear, does not need to mark specially.As in whole storage space, political class video, sport category video, recreational video and educational video proportion set in advance.

(2) Bayesian classifier model is set up:

To distribute separate precondition based on the feature in entry, Naive Bayes Classifier can represent by following mathematical form:

P (c_{i} | d, θ) = \frac{P (c_{i} | θ) P (d | c_{i}, θ)}{P (d | θ)} = \frac{P (c_{i} | θ) Π_{j = 1}^{m} P {(f_{j} | c_{i}, θ)}^{n_{j}}}{P (d | θ)} - - - (1)

Wherein, c _ifor class label, d represents the entry of search, and θ represents the parameter relevant to prior probability.Here entry d is made up of several lexical items, f _jrepresent lexical item, m is lexical item number, as entry " Venezuela intend to China payment of a debt island exposure " can be divided into several lexical items " Venezuela, intend to, China, payment of a debt, island, exposure ", such m=6; Also have the entry of some film and television plays, lexical item may only have 1, as " fighting secretly ".

For different classifications, the denominator P (d| θ) in above-mentioned formula is constant, remembers that total classification number is C, and so P (d| θ) can be expressed as

P (d | θ) = Σ_{i = 1}^{C} (P (c_{i} | θ) Π_{j = 1}^{m} P {(f_{j} | c_{i}, θ)}^{n_{j}}) - - - (2)

If thought at P (c _i| d, θ) value maximum time class label be exactly the target classification of entry d, as long as so select the classification making formula (1) molecule maximum, that is this need further to determine prior probability P (c _i| θ) and likelihood function item in P (f _j| c _i, θ) estimated value.Under present case, training sample is known, can only determine prior probability and likelihood function, namely by training sample

P(c _i|θ)＝N _i/N (3)

P (f_{j} | c_{i}, θ) = \frac{1 + n_{i, j}}{m + Σ_{k = 1}^{m} n_{k, i}} - - - (4)

Wherein, N _irepresent that training sample is concentrated and belong to classification c _ientry number, N represent training sample concentrate all entry numbers, n _i,jrepresent that training sample concentrates feature j at classification c _ithe number of times of middle appearance, m is the lexical item number in entry d.

(3) compare the probability that new term belongs to all classes, this entry is assigned in that maximum classification of probability.

Step 4: check and arrangement

Examine the classification results of sorter in step 3, carry out manual sorting for other deviation extremely individual, and upload in onboard servers by the corresponding video file category of all high frequency entry institutes, in whole onboard servers, the renewal process of video is complete.

Claims

1., towards subway/high ferro onboard servers video update method of passenger demand, it is characterized in that comprising the following steps:

Step one, the searching record text required for acquisition, these texts are searched plain historical record to obtain on some search engines and video website by user, and the content of text is made up of some search entry;

Step 2, under Hadoop platform, based on MapReduce framework statistics high frequency entry, comprises two stages:

The Map stage: be some parts by searching record text segmentation, as the input of Map function, all entries in received literary composition travel through by each Map function, as long as each entry occurs just carrying out marking once, the output of final Map function is one group of key-value pair;

The Map stage can be exported the value set with identical major key and pass to Reduce function together by Reduce stage: Hadoop, Reduce function receives a form as < major key, the input of value collection >, Reduce function is responsible for processing value collection, and output is also the form of key-value pair;

Count the number of times that different entry occurs in search history record, then adopt sort algorithm to sort to the number of times that search entry occurs, filter out the popular entry that a part of frequency of occurrence is higher;

Step 3, adopts Naive Bayes Classifier that the high frequency entry of step 2 gained is carried out taxonomic revision

First, mark entry, builds training sample set;

Secondly, Bayesian classifier model is set up:

To distribute separate precondition based on the feature in entry, the following form of Naive Bayes Classifier represents:

P (c_{i} | d, θ) = \frac{P (c_{i} | θ) Π_{j = 1}^{m} P {(f_{j} | c_{i}, θ)}^{n_{j}}}{P (d | θ)} - - - (1)

Wherein, c _ifor class label, d represents the entry of search, and θ represents the parameter relevant to prior probability; Entry d is made up of several lexical items, f _jrepresent lexical item, m is lexical item number;

For different classifications, the denominator P (d| θ) in above-mentioned formula is constant, as long as select the classification making formula (1) molecule maximum as target classification; Under present case, training sample is known, determines prior probability P (c by training sample _i| θ) and likelihood function item in P (f _j| c _i, θ), namely

P(c _i|θ)＝N _i/N (2)

P (f_{j} | c_{i}, θ) = \frac{1 + n_{i, j}}{m + Σ_{k = 1}^{m} n_{k, i}} - - - (3)

Wherein, N _irepresent that training sample is concentrated and belong to classification c _ientry number, N represent training sample concentrate all entry numbers,

N _i,jrepresent that training sample concentrates feature j at classification c _ithe number of times of middle appearance, m is the lexical item number in entry d;

Finally, compare the probability that new term belongs to all classes, this entry is assigned in that maximum classification of probability;

Step 4, examines the classification results of sorter in step 3, carries out manual sorting for other deviation extremely individual, and uploads in onboard servers by the corresponding video file category of all high frequency entry institutes, and in whole onboard servers, the renewal process of video is complete.