WO2015030637A1 - Apparatus and method for processing data streams in a communication network - Google Patents
Apparatus and method for processing data streams in a communication network Download PDFInfo
- Publication number
- WO2015030637A1 WO2015030637A1 PCT/SE2013/050994 SE2013050994W WO2015030637A1 WO 2015030637 A1 WO2015030637 A1 WO 2015030637A1 SE 2013050994 W SE2013050994 W SE 2013050994W WO 2015030637 A1 WO2015030637 A1 WO 2015030637A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- graphical representation
- value
- relating
- vertices
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/04—Processing captured monitoring data, e.g. for logfile generation
- H04L43/045—Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/0284—Traffic management, e.g. flow control or congestion control detecting congestion or overload during communication
Definitions
- the present invention relates to an apparatus and method for processing data streams in a communication network, for a telecommunications network.
- QoS Quality of Service
- VIP valued
- KPIs Key Performance Indicators
- the method comprises the steps of extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window.
- the extracted data is converted into a format suitable for graphical representation, and then a graphical representation of the converted extracted data generated.
- An estimated value is determined of at least one structural property of the graphical representation of the data.
- the estimated value of the at least one structural property is compared with a threshold value, and a change condition reported based on the outcome of the comparison step.
- a distributed processing architecture for processing a data stream of a
- the distributed processing architecture comprises a first processing unit adapted to extract data from the data stream, wherein the data is extracted for a particular time window of a sliding time window.
- a second processing unit is adapted to receive the extracted data from the first
- a processing unit and convert the extracted data into a format suitable for graphical representation.
- a third processing unit is adapted to generate a graphical representation of the converted extracted data.
- a fourth processing unit is adapted to determine an estimated value of at least one structural property of the graphical representation of the data, and further adapted to compare the estimated value of the at least one structural property with a threshold value, and report a change condition based on the outcome of the comparison.
- Figure 1 shows a method according to an embodiment of the invention
- Figure 2 shows an apparatus according to an embodiment of the invention
- Figure 3 shows a method according to another embodiment of the invention.
- Figure 4 shows a method according to another embodiment of the invention.
- Figure 5 shows an apparatus according to an embodiment of the invention
- Figure 6 shows a typical application of an embodiment of the invention
- Figures 7a to 7j show a further example of a typical application of an
- the embodiments of the invention enable data streams to be processed in a real time environment to enable a network node or network operator to obtain detailed network information dynamically, or in real time, such that the detailed network information can be used for various tasks, for example catering for the needs of customers (such as valued customers), or for upgrading their loyalty offerings, or other location based quality of service improvements. It is noted that the results of the data processing can be used for
- FIG. 1 shows a method according to an embodiment of the invention, for processing a data stream of a communication network in a distributed
- the method comprises the steps of extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window, step 101.
- the extracted data is converted into a format suitable for graphical representation, step 103. It is noted that the exact type of conversion will depend on a particular application (and hence the type of data extracted), and also the type of subsequent processing being performed.
- the extracted data stream may be converted into a suitable format for generating a graphical representation of the data, or for making a readable file for a specific
- a graphical representation of the converted extracted data is then generated, step 105, and an estimated value of at least one structural property of the graphical representation of the data determined, step 107.
- the estimated value of the at least one structural property is compared with a threshold value, step 109, and
- step 1 1 1 a change condition reported, step 1 1 1 , based on the outcome of the
- a disorientation can comprise, for example, any form of abnormal condition or situation, or the presence of loyal customers near to a highly transacted cell tower.
- Disorientation in a social network environment can comprise, for example, an abnormal condition such as the spread of some unwanted news very quickly, which could affect the integrity or security of a country or society, or some individual or company's reputation. It is noted that other forms of disorientation are intended to be embraced by the invention, as defined by the appended claims.
- the method steps 101 to 1 1 1 described above may be performed in a plurality of different processing units of the distributed processing architecture. For example, according to one embodiment each of the steps is performed in a separate processing unit. According to another example steps 107 to 1 1 1 are performed in the same processing unit, while the other steps are performed in separate processing units.
- the embodiments of the invention enable large data sets to be handled dynamically in real time by the manner in which different processing units act on the data extracted from the sliding time window in a sequential manner, but also due to the manner in which the extracted data is represented graphically, such that at least one structural property can be determined and then compared with a threshold value in order to trigger a change condition (for example by generating an alarm condition), which can be used to automatically alert a network operator that action may be needed, and/or automatically change one or more parameters of the communication network.
- a change condition for example by generating an alarm condition
- conversations such as Voice Calls or SMS messages
- m essages in a m obi le telecommunications network can be represented as a large stream of edges in a social graph representation.
- Such streams are typically very large, because of the large amount of underlying activity in such networks.
- the em bodiments of the invention provide a graphical representation or visualization of data streams in a distributed environment, such that processing units and/or network operators can process the large data streams in a realtime manner.
- FIG. 2 shows a distributed processing architecture 200 for processing a data stream 201 of a communications network, according to an embodiment of the invention.
- the distributed processing architecture 200 comprises a first processing unit 203 adapted to extract data from the data stream 201 .
- the data is extracted for a particular time window of a sliding time window.
- a second processing unit 205 is adapted to receive the extracted data from the first processing unit, and convert the extracted data into a format suitable for graphical representation.
- a third processing unit 207 is adapted to generate a graphical representation of the converted extracted data. As will be explained later in the application, this may involve processing by the third processing unit 207 alone, and/or processing by another processing entity (not shown).
- a fourth processing unit 209 is adapted to determine an estimated value of at least one structural property of the graphical representation of the data.
- the fourth processing unit 209 is further adapted to compare the estimated value of the at least one structural property with a threshold value, and report a change condition 21 1 based on the outcome of the comparison.
- processing tasks described above may be combined for processing by another processing unit of the distributed processing arch itecture, and/or separated for processing by separate processing units.
- steps performed by the fourth processing unit 209 may be separated and processed by different processing units.
- Each of the plurality of processing units may process data in parallel based on load, and each processing unit may be split into a plurality of different processing units to execute a task.
- the data streams being processed may comprise customer or cell tower data, for example.
- the embodiments enable the data streams to be processed such that the m ethod and apparatus can allow change conditions (or alarm conditions) to be detected and reported automatically, but while also providing the data in a manner such that network operators can visualize the findings from real-time streams, thereby providing an opportunity for the operators to perform better services to their valued customers.
- this may be provided using a topology that comprises a distributed storm framework that consists of various built-in components that are configured to accept the data stream, process it and visualize the same in the form of graphs.
- the data from the data stream is extracted by means of a sliding time window mechanism and passed on to the subsequent components present in the topology. The data thus moves through
- a storm framework provides a set of general primitives for performing distributed real-time computation, and can be used for "stream processing", by processing regular messages and updating databases on a real-time basis.
- a distributed processing architecture that is used for continuous computation, whereby a continuous query is performed on the data streams, with the results being streamed out to users as they are computed.
- Storm terminology will be familiar to a person skilled in the art, and includes terminology such as Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies.
- embodiments of the invention may process the data by managing clusters of queues and workers. Such an example involves sequential processing which processes data through the techniques of managing a plurality of clusters of systems with queues and workers as a processing node.
- the step of generating a graphical representation of the extracted data may comprise the step of interfacing with a graphical visualization unit to generate the graphical representation of the data, for example a graphical visualization unit that is adapted to process a Gephi ® application, as will be described later in the application.
- Gephi is an open-source network analysis and visualization software package written in Java. The goal of this tool is to help data analysts to make hypothesis, intuitively discover patterns, and isolate structure singularities or faults during data sourcing. It is a complementary tool to traditional statistics, as visual thinking with interactive interfaces is now recognized to facilitate reasoning.
- the m a i n p rof it fro m th i s fast graph visualization engine is to speed-up understanding and pattern discovery in large graphs.
- a graphical representation of the data comprises a set of vertices (V) and a set of edges (E) between the set of vertices, and wherein an edge (E ) connects a first vertex (V,) with a second vertex (V j ).
- V vertices
- E edges
- a first example of a structural property comprises an average path length value, IG, relating to the average number of steps along the shortest paths for all possible pairs of first and second vertices.
- the average path length value l G provides a measure of the efficiency of information or mass transport on a network.
- a second example of a structural property comprises a connected component count value, relating to a sub-graphical representation of the graphical representation of the data, in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graphical representation.
- a connected component count value relating to a sub-graphical representation of the graphical representation of the data, in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graphical representation.
- component of an undirected graph is a sub-graph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graph.
- the number of connected components is an important topological invariant of a graph.
- a third example of a structural property comprises an average clustering coefficient value, each clustering coefficient value providing an indication regarding the degree to which vertices (or nodes) in a graph tend to cluster together.
- Each clustering coefficient is therefore a measure of the degree to which nodes in a graph tend to cluster together.
- nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes.
- This property is calculated for the generated graphical representation of the data, for example using a Gephi toolkit.
- the neighbourhood N, for a vertex v is defined as its immediately connected neighbours as follows: Where k, is defined as the number of vertices,
- the clustering coefficient for the whole network AC is given as the average of the clustering coefficients of all the vertices n:
- a fourth example of a structural property comprises an average degree value, relating to the number of edges in a set of edges E in comparison to the number of vertices in the set of vertices V.
- a fifth example of a structural property comprises a graph density value, relating to a measure of how many edges are in a set of edges E compared to a maximum possible number of edges between vertices is the set of vertices V.
- G (V, E) measures how many edges are in set E compared to the maximum possible number of edges between vertices in set V.
- a directed graph can have at most
- a sixth example of a structural property is a modularity value, relating to a measure of the strength of division of a graph into modules.
- Modularity is one measure of the structure of graphs. It is used to measure the strength of division of a graph into modules (also called groups, clusters or communities).
- Graphs with high modularity have dense connections between the nodes within
- a formulation of the modularity is as follows. Define S, r to be 1 if vertex i belongs to group r and zero otherwise. Then
- a seventh example of a structural property is an average weighted degree value, relating to an average of the sum of weights of the edges of the nodes. This structural property possesses a higher estimated value obtained through regression analysis, and hence serves as a discriminant to detect the possible occurrence of disorientations.
- an estimated value of a structural property of the graphical representation of the data may comprise any one or more of the examples described above.
- estimated values relating to two or more structural properties are combined to provide a single aggregated estimated value, the single aggregated estimated value being compared with the threshold value.
- the system can be configured to select which combination of structural properties would be best suited to monitor a particular aspect in the communications network, with the estimated values for the selected structural properties then being aggregated or combined into one single estimated value.
- the single estimated value (representing a plurality of separate estimated values for the various structural properties) is then compared during use with a single threshold value, in order to detect a change condition or alarm condition relating to the communications network being monitored.
- the threshold value may itself have been formed using historical data (either from an initialization phase or on- the-fly during use), and whereby estimated values for similar or the same structural properties are aggregated to form the threshold value.
- a single estimated value representing at least one structural property is compared with a threshold value, which itself represents a corresponding at least one structural property.
- an estimated value of a particular structural property may be compared with a respective threshold value for that respective structural property.
- each estimated value relating to a structural property is compared with its own threshold value, rather than aggregating them first as described in the section above.
- an estimated value of a single structural property alone can be sufficient to indicate a disorientation for generating a change condition (or alarm).
- the threshold value may be determined during an initialization phase of operation.
- the initialization phase of operation may comprise the steps of:
- the method further comprises the step of updating the threshold value during use, by periodically performing the steps outlined above using more recent historical data, thereby adjusting the threshold value dynamically during use.
- an initialization phase or offline phase
- an online phase
- a method according to an embodiment of the invention describes the steps performed during an offline or initialization stage, whereby historical data or training data is used to determ ine or fix an initial threshold value (or values).
- the method comprises the step of fixing the threshold value, step 301 .
- This may comprise, during a first phase of study, processing for fixed time slots location specific data that has been stored offline in a database.
- data is retrieved through a sliding time window.
- the data stored offline may comprise test data or historical data from the past.
- the time slots used during this offline mode may correspond to similar time slots as those used during online operation, although it is noted that data from subsequent time slots may also be used for the calculations performed during the offline or initialization phase.
- the training data set or historical data may contain, for example, details of the earlier movements of customers and the related transactions (for example call detail records such as SMS, voice calls, data usage, etc. ).
- This data is processed in step 305, for example converted into a format suitable for graphical representation, and represented in the form of a graph, step 307.
- At least one or more structural properties are chosen to detect changes, and the at least one structural property obtained in step 309. Therefore, during the set-up phase, the system is effectively being configured to determine which one or more structural properties are going to be used to provide the comparisons which will later be made during the online or real time analysis, for example based on which one or more structural properties have previously led to a disorientation at a specific location. It is noted that which structural properties to select will depend upon a particular application, based on which structural properties are more useful than others once the graphical representation is analysed.
- the obtained structural property (or properties) are compared with the threshold value (or values) in order to detect possible occurrences of changes in the system, and any changes reported in step 31 1 .
- a threshold value may be determ ined on-the-fly during online or real-time implementation.
- the th reshold val ues m ay be continuously updated based on the results of offline processing, thereby providing updated knowledge, as will be explained in further detail below.
- step 401 data is retrieved or extracted from a data stream, for example using a sliding time window such that data from the stream is passed in real-time through a sliding time window mechanism.
- the retrieved data is then processed in step 403, for example converted into a different format, and a location-specific graph generated in step 405. Then, one or more structural properties of the graph are obtained and calculated, step 407.
- the one or more structural properties are compared with the threshold value obtained by performing an analysis on past values, step 409, such that the threshold value can be fixed on-the-fly to make the system as dynamic as possible, to accommodate the evolving nature of the data stream.
- This process therefore checks, for example, the QoS of valued customers present in the specific location, with changes detected and reported in step 41 1 , and may be recorded whenever the current value exceeds the threshold value.
- a threshold value (or values) during an initialization stage (as shown in Figure 3) and the determination of a threshold value (or values) on-the-fly (as shown in Figure 4) may be associated or used in the same system.
- FIG. 5 shows a distributed processing architecture 500 according to another embodiment of the present invention.
- a first processing unit 203 retrieves or extracts data from a data stream 201 using a sliding time window, which is passed to a second processing unit 205 of the distributed framework.
- the second processing unit 205 is adapted to convert the extracted data, such that it is transformed into a suitable file format that is best suited for use by a subsequent processing node, i.e. processing unit 207 in this example.
- the second processing unit 205 is configured to receive the extracted data from the first processing unit 203, and convert or transform the extracted data into a format suitable for graphical representation.
- the third processing unit 207 is adapted to generate a graphical representation of the converted extracted data.
- the third processing unit 207 is configured in this embodiment to interface with a graphical visualization unit 513 when generating the graphical representation of the extracted data.
- the second processing unit 205 will have converted the data into a format which is best suited for importation into the graphical visualization unit 513 for visualization purposes.
- the extracted streaming data may be converted into a suitable format for generating a graphical representation of the data, or for making a readable file for a specific visualization software, for example Gephi software for generating a graph, and can involve, for example, converting the data into ".net" type data files.
- a fourth processing node 209 is adapted to determine at least one structural property for the graphical representation that has been generated by the third processing unit 207 (the third processing unit possibly having the assistance of the visualization unit 513), and further adapted to aggregate the at least one structural property into a single estimated value based on an analysis (for example regression or prediction analysis), in order to be compared with a dynamic threshold value obtained through an analysis on the past data (which is stored in an updated knowledge base 515) to detect the possible presence of changes 21 1 in a system.
- an analysis for example regression or prediction analysis
- a dynamic threshold value obtained through an analysis on the past data (which is stored in an updated knowledge base 515) to detect the possible presence of changes 21 1 in a system.
- the steps performed by a particular processing node for example the steps performed by node 209, may be split for processing by separate nodes, if desired.
- the at least one structural property may comprise one or more of the structural properties described earlier in the application, or another structural property of the graphical representation.
- each data stream generated from one single location is processed on a sliding time window and then sent to a graphical visualization unit 51 3 that is interfaced with the distributed framework to visualize the data in the form of a graph.
- the same process can be applied for multiple locations to track and provide intended QoS of other users (such as
- the subsequent processing units obtain the structural properties of the generated graph to arrive at a single aggregated value based on a prediction or regression analysis. This may be performed by analyzing test data that assigns a specific estimate for each of the types of structural property, so as to be compared with a threshold value to detect the possible occurrence of changes in the system and activate alarms.
- the apparatus can be adapted to periodically perform an analysis on the past data, for example more historic data, to fix the threshold values on-the-fly.
- the system takes an input data stream, for exam ple from charging/billing nodes, and feeds the data to a distributed processing architecture or framework.
- the stream data is read (or extracted or retrieved) through a first processing node and passed on to subsequent processing nodes in the framework.
- a subsequent processing node may be a node which processes the obtained data to transform the data into a suitable file format that can be imported into a graphical visualization unit 513 (for example GEPHI ® ) for visualization purpose.
- a graphical visualization unit 513 for example GEPHI ®
- the structural properties obtained for the generated graph by a fourth processing unit 209 are aggregated into a single estimated value based on predictive analysis in order to be compared with a threshold value (for example a dynamic threshold value) obtained through an analysis on past or historical data (which is stored in an updated knowledge base 515) to detect the possible occurrence of changes in the system.
- a threshold value for example a dynamic threshold value
- the occurrence of changes in the system is reported as a change condition or alarm condition 21 1 to a stakeholder.
- an alarm may be reported through an alarm agent used in the OSS/BSS system. This alarm indicates the changes
- FIG. 6 shows an example of how an embodiment of the invention may be used to generate a change condition or an alarm condition.
- the section labeled 600 relates to elements of a conventional telecommunications system, while the section labeled 200/500 relates to a distributed processing architecture according to embodiments of the invention.
- Charging or billing nodes 601 generate a data stream, which is retrieved or extracted by a node 603 of the distributed processing architecture 200/500.
- the module 605 is adapted to process the extracted data stream, for example using the techniques described in the embodiments above, and report changes that are detected via this monitoring, in order to generate an alarm signal 607 and its corresponding details.
- An alarm signal can be used, for example, to prompt end users, and to indicate a disorientation at a particular location, such that further action can be taken.
- a network operator usually knows about the occurrence of huge events at a specific location at a particular time of day. Network operators need to react if their loyal customers are present at that location at that time (for example to provide better quality of service to their most loyal or influential customers).
- the embodiments of the invention therefore enable a change condition or alarm signal to be generated, to indicate the presence of loyal customers at a specific location to the network operators, such that the network operators can take immediate action to provide a better quality of service to their customers.
- the alarm signal may be used by an operational support systems (OSS) and/or business support systems (OSS/BSS) 609, to generate a report of an alarm 61 1 , which can aid such faster decision making (or consequential action).
- OSS operational support systems
- BSS business support systems
- call detail records of mobile phone customers and tower data streams.
- the input consists of the corresponding call detail records for a time window of a sliding time window, for example the current hour, with the call detail records comprising fields such as: locality, sub-locality, timestamp, originating antenna id, terminating antenna id, total number of calls shared between the antennae during the current hour and valued customer call and mobility details
- the data stream for example the current hour's CDR data stream, is processed to generate a corresponding ".net” file that labels all the nodes and marks edges with the labelled nodes, and the weight of the edge being equated to the total number of calls shared between the antennae.
- the generated ".net” file is imported into a graphical visualization unit so that a graphical representation of the data for the current hour can be generated.
- the generated graphical representation is then analysed to obtain its structural properties (for example one or more structural properties) and hence form a single aggregated value by combining the values of the different one or more structural properties, with prediction analysis being used to check whether the single aggregated value is within a threshold value. If the current value exceeds the threshold, then the changes are reported. The changes may be reported along with the unique identifier (ID) of a sub-set of customers, for example the loyal customers (most influential users in the network), who get affected due to the available QoS at that current timing.
- ID unique identifier
- This report can be used by the service providers to perform efficient load balancing in a specific location in order to deal with the changes, and thus provide better services to the valued customers. These steps are executed in a
- a disorientation can comprise, for example, any form of abnormal condition or situation, or the presence of loyal customers near to a highly transacted cell tower.
- Disorientation in a social network environment can comprise, for example, an abnormal condition such as the spread of some unwanted news very quickly, which could affect the integrity or security of a country or society, or some individual or company's reputation. It is noted that other forms of disorientation are intended to be embraced by the invention, as defined by the appended claims.
- the embodiments of the present invention may be used in an application that provides sentiment analysis for social network data, for example when analyzing data streams from a social network such as Twitter.
- Sentiment analysis or opinion m ining refers to the application of natural language processing, computational linguistics and text analytics to identify and extract subjective information in source materials.
- a model such as a AFINN model (AFINN being an affective lexicon by Finn Arup Nielsen) can be used that lists various words and phrases rated for valence with an integer between -5 and +5, where -5 denotes the one with the most negative sentiment and +5, the most positive sentiment.
- this process can be used by embodiments of the invention in the analysis of the sentiment of a tweet in a Twitter data stream. The steps involved in a Twitter data stream analysis are explained further below. The location specific tweets through the Twitter Search API are retrieved
- the retrieved tweets may be processed to remove certain words, for example stop words such as "a, an, the, those" etc., and then categorized based on the keywords present.
- the categories may comprise, for example, categories relating to a number of topics such as "Music, Sports, Politics and Others". Then, further analyses is performed on the sentiment of the keywords by grouping them as "happy or sad" by the use of a sentiment analysis model.
- the categorized specific location tweets can be transformed into a ".net” file to import into the graphical visualization unit for graph generation purpose.
- the nodes represent the users and the edges, the total number of tweets shared between the users.
- the generated user graph can then be analyzed to obtain its structural properties, and thereby perform ing a regression analysis to check for the possible presence of a disorientation.
- this process is run in a parallelized real-time distributed framework interfaced with a visualizer by visualizing the twitter data stream.
- the estimated value represents an example of a disorientation that has happened in one situation, where coefficients denote the constants calculated and the estimated value refers to the abnormal value that is compared to the regular threshold value.
- Figures 7a to 7i there can be seen the evolving nature of the data streams in real time.
- the threshold value was fixed dynamically on-the-fly.
- the evolution of CDR data stream is traced by visualizing the same for each hour by means of a graph.
- Graphs 7a to 7i show the graphs at time windows corresponding to 12am, 01 am, 04am, 06am, 10am, 12pm , 1 pm, 2pm and 3pm, respectively.
- the sizes of the various nodes are used to rank "centrality" (for example in Figure 7a the node 701 therefore being more central than node 703) , wh ile different shad ing is used to show "modularity".
- the thickness of the edge determines how strong the link between the nodes is (for example, in Figure 7a the link 705 between nodes 707 and 709 is stronger than the link 71 1 between nodes 707 and 701 ).
- disorientations were found at 1 PM ( Figure 7g), 2PM ( Figure 7h) and 3PM ( Figure 7i) for the given CDR data stream, and these disorientations would have been reported as change conditions or alarms, such that necessary actions can be taken, for example load balancing.
- the disorientations may be detected at these times by comparing the estimated value for the structural properties of these graphs with the threshold value (i.e. an aggregated estimated value based on the structural properties noted in an equation as shown on the previous page), with the aggregated estimated value exceeded the threshold value in the graphs for 1 pm , 2pm and 3pm .
- a disorientation is therefore detected as a change in the estimated value, when the estimated value goes above the threshold value. Calculation of the estimated value is based on the changes in each graphical (structural) property.
- the overloaded antennae involved can be identified and a list of users (for example loyal customers) who are using those particular antennae can also be generated, as illustrated in Figure 7j, to aid a service provider in taking out necessary actions.
- the data stream is location specific, and the step of reporting a change condition can further comprise the steps of
- the step of changing a service parameter may comprise the step of performing a location-based quality of service change.
- the embodiments of the invention provide an approach that integrates the distributed framework and a visualizer to visualize large streams of data by constructing graphs.
- the embodiments have been implemented with the capability to visualize the location specific CDR data stream in the form of graphs in a real-time distributed environment, for example to automatically or dynamically detect changes without affecting the loyal customers through load balancing.
- the subsequent component (processing nodes) obtain the structural properties of the generated graph to arrive at a single aggregated value based on the prediction method. This may be performed by analyzing the test data that assigns a specific coefficient for each structural property, so as to be compared with a threshold value to detect the possible occurrence of changes in the system and activate alarms. To make the threshold value more dynamic, the system may be configured to periodically perform an analysis on the past data to fix the threshold values on-the-fly (which may involve some offline processing).
- the embodiments of the invention described above provide a distributed framework having an integrated visualization unit, for visualizing large streams of data by constructing social graphs on different time windows.
- the embodiments of the invention provide the capability to visualize the location specific data stream in the form of graphs in a real-time distributed environment.
- changes in trends can be reported by means of graphical representation and by analysing the various structural properties of social graphs to track the changes. Structural properties of social graphs are analysed to determine change in trends, which can be tracked over time i.e. subsequent sliding windows
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method of processing a data stream of a communication network in a distributed processing architecture comprising a plurality of processing units comprises the steps of extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window. The extracted data is converted into a format suitable for graphical representation. A graphical representation of the converted extracted data is generated, and an estimated value of at least one structural property of the graphical representation of the data determined. The estimated value of the at least one structural property is compared with a threshold value, and a change condition reported based on the outcome of the comparison step. The change condition may be used, for example, to change a location based quality of service parameter.
Description
Apparatus and method for processing data streams
in a communication network Technical Field
The present invention relates to an apparatus and method for processing data streams in a communication network, for a telecommunications network.
Background
In a communication network such as a telecommunications network, it is desirable to be able to understand the Quality of Service (QoS) being provided to customers. In particular, there is a requirement to determine the QoS being provided to valued (VIP) customers at a specific location, and based on Key Performance Indicators (KPIs) endorse better services and promote new recommendations to such valued customers. This type of information is required by telecommunication network operators, for exam ple in order to provide information for operational support systems (OSS) and business support systems (BSS), so that the telecommunication network operators can serve their loyal customers well.
In this regard, understanding and visualizing the movements of customers to new locations is an interesting analytics operation which the mobile phone operators desire to execute. Traditional approaches do not provide solutions which are fast enough to enable real time operation to be performed, for example for checking the QoS related to valued customers.
Existing solutions are unable to handle large quantities of online transaction data (for example customer call data or cell tower data), and as such are unable to extract any patterns or trends from the data in a meaningful way.
P40725
Summary
It is an aim of the present invention to provide a method and apparatus which obviate or reduce at least one or more of the disadvantages mentioned above. According to a first aspect of the present invention there is provided a method of processing a data stream of a communication network in a distributed
processing architecture comprising a plurality of processing units. The method comprises the steps of extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window. The extracted data is converted into a format suitable for graphical representation, and then a graphical representation of the converted extracted data generated. An estimated value is determined of at least one structural property of the graphical representation of the data. The estimated value of the at least one structural property is compared with a threshold value, and a change condition reported based on the outcome of the comparison step.
According to another aspect of the present invention there is provided a distributed processing architecture for processing a data stream of a
communications network. The distributed processing architecture comprises a first processing unit adapted to extract data from the data stream, wherein the data is extracted for a particular time window of a sliding time window. A second processing unit is adapted to receive the extracted data from the first
processing unit, and convert the extracted data into a format suitable for graphical representation. A third processing unit is adapted to generate a graphical representation of the converted extracted data. A fourth processing unit is adapted to determine an estimated value of at least one structural property of the graphical representation of the data, and further adapted to compare the estimated value of the at least one structural property with a threshold value, and report a change condition based on the outcome of the comparison.
2
Brief description of the drawings
For a better understanding of examples of the present invention, and to show more clearly how the examples may be carried into effect, reference will now be made, by way of example only, to the following drawings in which:
Figure 1 shows a method according to an embodiment of the invention; Figure 2 shows an apparatus according to an embodiment of the invention;
Figure 3 shows a method according to another embodiment of the invention;
Figure 4 shows a method according to another embodiment of the invention;
Figure 5 shows an apparatus according to an embodiment of the invention;
Figure 6 shows a typical application of an embodiment of the invention; and Figures 7a to 7j show a further example of a typical application of an
embodiment of the invention.
Detailed description
The embodiments of the invention, as will be described below, enable data streams to be processed in a real time environment to enable a network node or network operator to obtain detailed network information dynamically, or in real time, such that the detailed network information can be used for various tasks, for example catering for the needs of customers (such as valued customers), or for upgrading their loyalty offerings, or other location based quality of service improvements. It is noted that the results of the data processing can be used for
3
other applications, without departing from the scope of the invention as defined in the appended claims.
In the examples described below the data streams will be described in the context of an application relating to a location specific data stream such as call detail records of a specific place, for example received from a mobile communications operator, or a location specific data stream such as Twitter® data streams related to a specific place, for example from a social network. It is noted, however, that the embodiments of the invention can be used with other types of data streams and other data feeds, or other social networking sites. It is also noted that in the context of social networks, a specific place (location) can be extended from a small geographic area to a specific country, or even the entire world in terms of social networks. Figure 1 shows a method according to an embodiment of the invention, for processing a data stream of a communication network in a distributed
processing architecture comprising a plurality of processing units. The method comprises the steps of extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window, step 101. The extracted data is converted into a format suitable for graphical representation, step 103. It is noted that the exact type of conversion will depend on a particular application (and hence the type of data extracted), and also the type of subsequent processing being performed. For example, the extracted data stream may be converted into a suitable format for generating a graphical representation of the data, or for making a readable file for a specific
visualization software, for example Gephi software for generating a graph. A graphical representation of the converted extracted data is then generated, step 105, and an estimated value of at least one structural property of the graphical representation of the data determined, step 107. The estimated value of the at least one structural property is compared with a threshold value, step 109, and
4
a change condition reported, step 1 1 1 , based on the outcome of the
comparison step.
The calculation of structural properties for the graphical representation of the data enable, for example, a disorientation at a particular location to be analysed and understood, based upon the change conditions that are reported. A disorientation can comprise, for example, any form of abnormal condition or situation, or the presence of loyal customers near to a highly transacted cell tower. Disorientation in a social network environment can comprise, for example, an abnormal condition such as the spread of some unwanted news very quickly, which could affect the integrity or security of a country or society, or some individual or company's reputation. It is noted that other forms of disorientation are intended to be embraced by the invention, as defined by the appended claims.
The method steps 101 to 1 1 1 described above may be performed in a plurality of different processing units of the distributed processing architecture. For example, according to one embodiment each of the steps is performed in a separate processing unit. According to another example steps 107 to 1 1 1 are performed in the same processing unit, while the other steps are performed in separate processing units.
The embodiments of the invention enable large data sets to be handled dynamically in real time by the manner in which different processing units act on the data extracted from the sliding time window in a sequential manner, but also due to the manner in which the extracted data is represented graphically, such that at least one structural property can be determined and then compared with a threshold value in order to trigger a change condition (for example by generating an alarm condition), which can be used to automatically alert a network operator that action may be needed, and/or automatically change one or more parameters of the communication network.
5
Furthermore, by generating a graphical representation of the extracted data, this enables a visualization of the data streams to be provided, thus aiding network operators from a visual sense, in addition to the automatic monitoring of change conditions noted above. The generation of graphs is sometimes referred to as a "Graph Stream". Due to the large volume of data that is available in a data stream of a telecommunications network, the representation of such data using graphs can help one to visualize and understand about the evolving nature of the data streams over a period of time. Many large web and communication network applications create data streams which can be represented as a sequential stream of edges in a social graph. For example, conversations (such as Voice Calls or SMS messages) in a telecommunication network, or m essages in a m obi le telecommunications network can be represented as a large stream of edges in a social graph representation. Such streams are typically very large, because of the large amount of underlying activity in such networks.
The em bodiments of the invention provide a graphical representation or visualization of data streams in a distributed environment, such that processing units and/or network operators can process the large data streams in a realtime manner.
Figure 2 shows a distributed processing architecture 200 for processing a data stream 201 of a communications network, according to an embodiment of the invention. The distributed processing architecture 200 comprises a first processing unit 203 adapted to extract data from the data stream 201 . The data is extracted for a particular time window of a sliding time window. A second processing unit 205 is adapted to receive the extracted data from the first processing unit, and convert the extracted data into a format suitable for graphical representation. A third processing unit 207 is adapted to generate a graphical representation of the converted extracted data. As will be explained
later in the application, this may involve processing by the third processing unit 207 alone, and/or processing by another processing entity (not shown). A fourth processing unit 209 is adapted to determine an estimated value of at least one structural property of the graphical representation of the data. The fourth processing unit 209 is further adapted to compare the estimated value of the at least one structural property with a threshold value, and report a change condition 21 1 based on the outcome of the comparison.
It is noted that one or more of the processing tasks described above may be combined for processing by another processing unit of the distributed processing arch itecture, and/or separated for processing by separate processing units. For example, the steps performed by the fourth processing unit 209 may be separated and processed by different processing units. Each of the plurality of processing units may process data in parallel based on load, and each processing unit may be split into a plurality of different processing units to execute a task.
The data streams being processed may comprise customer or cell tower data, for example. The embodiments enable the data streams to be processed such that the m ethod and apparatus can allow change conditions (or alarm conditions) to be detected and reported automatically, but while also providing the data in a manner such that network operators can visualize the findings from real-time streams, thereby providing an opportunity for the operators to perform better services to their valued customers.
According to one embodiment this may be provided using a topology that comprises a distributed storm framework that consists of various built-in components that are configured to accept the data stream, process it and visualize the same in the form of graphs. The data from the data stream is extracted by means of a sliding time window mechanism and passed on to the subsequent components present in the topology. The data thus moves through
7
the topology in a sequential manner from one component or processing unit to another.
A storm framework provides a set of general primitives for performing distributed real-time computation, and can be used for "stream processing", by processing regular messages and updating databases on a real-time basis. Thus, in an embodiment using storm processing there is provided a distributed processing architecture that is used for continuous computation, whereby a continuous query is performed on the data streams, with the results being streamed out to users as they are computed. Storm terminology will be familiar to a person skilled in the art, and includes terminology such as Streams, Spouts, Bolts, Tasks, Workers, Stream Groupings, and Topologies. As an alternative to using storm, embodiments of the invention may process the data by managing clusters of queues and workers. Such an example involves sequential processing which processes data through the techniques of managing a plurality of clusters of systems with queues and workers as a processing node.
According to one embodiment, the step of generating a graphical representation of the extracted data may comprise the step of interfacing with a graphical visualization unit to generate the graphical representation of the data, for example a graphical visualization unit that is adapted to process a Gephi® application, as will be described later in the application. Gephi is an open-source network analysis and visualization software package written in Java. The goal of this tool is to help data analysts to make hypothesis, intuitively discover patterns, and isolate structure singularities or faults during data sourcing. It is a complementary tool to traditional statistics, as visual thinking with interactive interfaces is now recognized to facilitate reasoning. The m a i n p rof it fro m th i s fast graph visualization engine is to speed-up understanding and pattern discovery in large graphs.
8
The method performed by the embodiment of Figure 1 comprises the step of determining an estimated value of at least one structural property of the graphical representation of the data, for example a graphical representation generated using Gephi. An explanation will now be provided of examples of the at least one structural property that can be used by embodiments of the invention. It is noted that other structural properties may also be used, without departing from the scope of the invention as defined in the appended claims. A graphical representation of the data comprises a set of vertices (V) and a set of edges (E) between the set of vertices, and wherein an edge (E ) connects a first vertex (V,) with a second vertex (Vj). In a communication network the vertices may represent nodes of the communication network, or represent users in the communication network, with edges representing links between such nodes or users.
A first example of a structural property comprises an average path length value, IG, relating to the average number of steps along the shortest paths for all possible pairs of first and second vertices.
The average path length value lG provides a measure of the efficiency of information or mass transport on a network. Consider a graphical representation of data G, having a set of vertices V. Let d(v1 ,v2), where (v1 ,v2 e V), denote the shortest distance between vertices v1 and v2. Assume that d(v1 ,v2)=0 if v1 =v2 or v2 cannot be reached from v1 . Then, the average path length lG is determined as:
lG =∑d(Vi,Vj) / (n.(n-1 ))
, where n is the number of vertices in the graphical representation G of the data.
A second example of a structural property comprises a connected component count value, relating to a sub-graphical representation of the graphical representation of the data, in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graphical representation. In other words, in graph theory a connected
component of an undirected graph is a sub-graph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graph. The number of connected components is an important topological invariant of a graph.
A third example of a structural property comprises an average clustering coefficient value, each clustering coefficient value providing an indication regarding the degree to which vertices (or nodes) in a graph tend to cluster together.
Each clustering coefficient is therefore a measure of the degree to which nodes in a graph tend to cluster together. In most real-world networks, for example mobile phone networks and social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes. This property is calculated for the generated graphical representation of the data, for example using a Gephi toolkit. As mentioned above, a graphical representation of data, G=(V,E), formally consists of a set of vertices V and a set of edges E between them. An edge e connects vertex v, with vertex Vj. The neighbourhood N, for a vertex v, is defined as its immediately connected neighbours as follows:
Where k, is defined as the number of vertices, |Ν,|, in the neighbourhood, N,, of a vertex, then the clustering coefficient is given as:
10
Ci = |{ejk : vj, vk e Ni, ejk ε Ε}| / (Ι¾.( ))
The clustering coefficient for the whole network AC is given as the average of the clustering coefficients of all the vertices n:
A fourth example of a structural property comprises an average degree value, relating to the number of edges in a set of edges E in comparison to the number of vertices in the set of vertices V.
The average degree of a graph G= (V, E) is therefore a measure of how many edges are in set E compared to number of vertices in set V. Because each edge is incident to two vertices and counts in the degree of both vertices, the average degree of an undirected graph is 2*|E|/|V|.
A fifth example of a structural property comprises a graph density value, relating to a measure of how many edges are in a set of edges E compared to a maximum possible number of edges between vertices is the set of vertices V. Thus, the density of a graph G = (V, E) measures how many edges are in set E compared to the maximum possible number of edges between vertices in set V. A directed graph can have at most |V| * (|V|-1 ) edges, so the density of a directed graph is: |E| / (|V| * (|V|-1 )).
A sixth example of a structural property is a modularity value, relating to a measure of the strength of division of a graph into modules. Modularity is one measure of the structure of graphs. It is used to measure the strength of division of a graph into modules (also called groups, clusters or communities). Graphs with high modularity have dense connections between the nodes within
11
modu les but sparse connections between nodes in different modules. Modularity is often used in optim ization methods for detecting community structure in graphs. A formulation of the modularity, is as follows. Define S,r to be 1 if vertex i belongs to group r and zero otherwise. Then
and hence
Q =∑∑ [Ay - (kikj/2m)] S, Sjr where S is the (non-square) matrix having elements S,r and B is the so-called modularity matrix, which has elements
All rows and columns of the modularity matrix sum to zero, which means that the modularity of an undivided network is also always zero.
A seventh example of a structural property is an average weighted degree value, relating to an average of the sum of weights of the edges of the nodes. This structural property possesses a higher estimated value obtained through regression analysis, and hence serves as a discriminant to detect the possible occurrence of disorientations.
It is noted that an estimated value of a structural property of the graphical representation of the data, according to embodiments of the invention, may comprise any one or more of the examples described above.
For example, according to one embodiment, estimated values relating to two or more structural properties are combined to provide a single aggregated estimated value, the single aggregated estimated value being compared with the threshold value.
12
By having the values of different structural properties aggregated into one estimated value, this allows a simple comparison to be made with just one threshold value, and can therefore have the advantage of reducing the complexity of data processing. Thus, in an application using this format, the system can be configured to select which combination of structural properties would be best suited to monitor a particular aspect in the communications network, with the estimated values for the selected structural properties then being aggregated or combined into one single estimated value. The single estimated value (representing a plurality of separate estimated values for the various structural properties) is then compared during use with a single threshold value, in order to detect a change condition or alarm condition relating to the communications network being monitored. The threshold value may itself have been formed using historical data (either from an initialization phase or on- the-fly during use), and whereby estimated values for similar or the same structural properties are aggregated to form the threshold value. As such, during use a single estimated value representing at least one structural property is compared with a threshold value, which itself represents a corresponding at least one structural property. According to an alternative embodiment, an estimated value of a particular structural property may be compared with a respective threshold value for that respective structural property. In such an embodiment each estimated value relating to a structural property is compared with its own threshold value, rather than aggregating them first as described in the section above. Thus, in such an embodiment an estimated value of a single structural property alone can be sufficient to indicate a disorientation for generating a change condition (or alarm).
The threshold value may be determined during an initialization phase of operation. The initialization phase of operation may comprise the steps of:
retrieving a data stream relating to historical data; generating a graphical
13
representation of the historical data; determining an estimated value for one or more structural properties of the graphical representation of the historical data; and analysing the estimated value of the one or more structural properties to generate the threshold value.
According to one embodiment the method further comprises the step of updating the threshold value during use, by periodically performing the steps outlined above using more recent historical data, thereby adjusting the threshold value dynamically during use. This has the advantage of making the system as dynamic as possible to accommodate the evolving nature of the data stream being processed.
Further details will now be provided about the different phases of operation of the embodiments of the invention, and in particular an initialization phase (or offline phase) and an online phase.
Referring to Figure 3, a method according to an embodiment of the invention describes the steps performed during an offline or initialization stage, whereby historical data or training data is used to determ ine or fix an initial threshold value (or values).
The method comprises the step of fixing the threshold value, step 301 . This may comprise, during a first phase of study, processing for fixed time slots location specific data that has been stored offline in a database. Thus, in step 303 data is retrieved through a sliding time window. The data stored offline may comprise test data or historical data from the past. The time slots used during this offline mode may correspond to similar time slots as those used during online operation, although it is noted that data from subsequent time slots may also be used for the calculations performed during the offline or initialization phase.
14
The training data set or historical data may contain, for example, details of the earlier movements of customers and the related transactions (for example call detail records such as SMS, voice calls, data usage, etc. ). This data is processed in step 305, for example converted into a format suitable for graphical representation, and represented in the form of a graph, step 307. At least one or more structural properties are chosen to detect changes, and the at least one structural property obtained in step 309. Therefore, during the set-up phase, the system is effectively being configured to determine which one or more structural properties are going to be used to provide the comparisons which will later be made during the online or real time analysis, for example based on which one or more structural properties have previously led to a disorientation at a specific location. It is noted that which structural properties to select will depend upon a particular application, based on which structural properties are more useful than others once the graphical representation is analysed.
The obtained structural property (or properties) are compared with the threshold value (or values) in order to detect possible occurrences of changes in the system, and any changes reported in step 31 1 .
Thus, one purpose of this offline implementation shown in the embodiment of Figure 3 is to understand the nature of the data so as to determine the threshold for the structural properties to efficiently detect changes in an online real-time implementation.
According to another embodiment shown in Figure 4, a threshold value (or threshold values) may be determ ined on-the-fly during online or real-time implementation. In such an embodiment the th reshold val ues m ay be continuously updated based on the results of offline processing, thereby providing updated knowledge, as will be explained in further detail below.
In step 401 data is retrieved or extracted from a data stream, for example using a sliding time window such that data from the stream is passed in real-time through a sliding time window mechanism. The retrieved data is then processed in step 403, for example converted into a different format, and a location-specific graph generated in step 405. Then, one or more structural properties of the graph are obtained and calculated, step 407. The one or more structural properties are compared with the threshold value obtained by performing an analysis on past values, step 409, such that the threshold value can be fixed on-the-fly to make the system as dynamic as possible, to accommodate the evolving nature of the data stream. This process therefore checks, for example, the QoS of valued customers present in the specific location, with changes detected and reported in step 41 1 , and may be recorded whenever the current value exceeds the threshold value.
It is noted that the determination of a threshold value (or values) during an initialization stage (as shown in Figure 3) and the determination of a threshold value (or values) on-the-fly (as shown in Figure 4) may be associated or used in the same system.
Figure 5 shows a distributed processing architecture 500 according to another embodiment of the present invention. A first processing unit 203 retrieves or extracts data from a data stream 201 using a sliding time window, which is passed to a second processing unit 205 of the distributed framework. The second processing unit 205 is adapted to convert the extracted data, such that it is transformed into a suitable file format that is best suited for use by a subsequent processing node, i.e. processing unit 207 in this example. Thus, the second processing unit 205 is configured to receive the extracted data from the first processing unit 203, and convert or transform the extracted data into a format suitable for graphical representation.
The third processing unit 207 is adapted to generate a graphical representation of the converted extracted data. The third processing unit 207 is configured in this embodiment to interface with a graphical visualization unit 513 when generating the graphical representation of the extracted data. The second processing unit 205 will have converted the data into a format which is best suited for importation into the graphical visualization unit 513 for visualization purposes. For example, the extracted streaming data may be converted into a suitable format for generating a graphical representation of the data, or for making a readable file for a specific visualization software, for example Gephi software for generating a graph, and can involve, for example, converting the data into ".net" type data files.
A fourth processing node 209 is adapted to determine at least one structural property for the graphical representation that has been generated by the third processing unit 207 (the third processing unit possibly having the assistance of the visualization unit 513), and further adapted to aggregate the at least one structural property into a single estimated value based on an analysis (for example regression or prediction analysis), in order to be compared with a dynamic threshold value obtained through an analysis on the past data (which is stored in an updated knowledge base 515) to detect the possible presence of changes 21 1 in a system. It is noted that the steps performed by a particular processing node, for example the steps performed by node 209, may be split for processing by separate nodes, if desired. The at least one structural property may comprise one or more of the structural properties described earlier in the application, or another structural property of the graphical representation.
From the above it can be seen that each data stream generated from one single location is processed on a sliding time window and then sent to a graphical visualization unit 51 3 that is interfaced with the distributed framework to visualize the data in the form of a graph. The same process can be applied for multiple locations to track and provide intended QoS of other users (such as
17
other valued customers).
The subsequent processing units obtain the structural properties of the generated graph to arrive at a single aggregated value based on a prediction or regression analysis. This may be performed by analyzing test data that assigns a specific estimate for each of the types of structural property, so as to be compared with a threshold value to detect the possible occurrence of changes in the system and activate alarms. To make the threshold value more dynamic, the apparatus can be adapted to periodically perform an analysis on the past data, for example more historic data, to fix the threshold values on-the-fly.
It can be seen from the embodiments above that the system takes an input data stream, for exam ple from charging/billing nodes, and feeds the data to a distributed processing architecture or framework. The stream data is read (or extracted or retrieved) through a first processing node and passed on to subsequent processing nodes in the framework. A subsequent processing node may be a node which processes the obtained data to transform the data into a suitable file format that can be imported into a graphical visualization unit 513 (for example GEPHI®) for visualization purpose. One of the subsequent processing nodes, the third processing node 207 of the examples above, interfaces the graphical visualization unit 513 with the distributed framework to generate the corresponding graph of the processed data obtained from the first and second processing nodes. The structural properties obtained for the generated graph by a fourth processing unit 209 are aggregated into a single estimated value based on predictive analysis in order to be compared with a threshold value (for example a dynamic threshold value) obtained through an analysis on past or historical data (which is stored in an updated knowledge base 515) to detect the possible occurrence of changes in the system. The occurrence of changes in the system is reported as a change condition or alarm condition 21 1 to a stakeholder. For example, an alarm may be reported through an alarm agent used in the OSS/BSS system. This alarm indicates the changes
18
for which action needs to be taken immediately. This is illustrated by the example shown in Figure 6.
Figure 6 shows an example of how an embodiment of the invention may be used to generate a change condition or an alarm condition. The section labeled 600 relates to elements of a conventional telecommunications system, while the section labeled 200/500 relates to a distributed processing architecture according to embodiments of the invention. Charging or billing nodes 601 generate a data stream, which is retrieved or extracted by a node 603 of the distributed processing architecture 200/500. The module 605 is adapted to process the extracted data stream, for example using the techniques described in the embodiments above, and report changes that are detected via this monitoring, in order to generate an alarm signal 607 and its corresponding details. An alarm signal can be used, for example, to prompt end users, and to indicate a disorientation at a particular location, such that further action can be taken. For example, a network operator usually knows about the occurrence of huge events at a specific location at a particular time of day. Network operators need to react if their loyal customers are present at that location at that time (for example to provide better quality of service to their most loyal or influential customers). The embodiments of the invention therefore enable a change condition or alarm signal to be generated, to indicate the presence of loyal customers at a specific location to the network operators, such that the network operators can take immediate action to provide a better quality of service to their customers.
The alarm signal may be used by an operational support systems (OSS) and/or business support systems (OSS/BSS) 609, to generate a report of an alarm 61 1 , which can aid such faster decision making (or consequential action).
As mentioned above the embodiments of the invention may be used in a variety
19
of applications. A first example will now be described in the context of call data, and in particular the steps involved in analysing call detail records (CDR) of mobile phone customers and tower data streams. From the C D R and cell tower data streams, the input consists of the corresponding call detail records for a time window of a sliding time window, for example the current hour, with the call detail records comprising fields such as: locality, sub-locality, timestamp, originating antenna id, terminating antenna id, total number of calls shared between the antennae during the current hour and valued customer call and mobility details
The data stream, for example the current hour's CDR data stream, is processed to generate a corresponding ".net" file that labels all the nodes and marks edges with the labelled nodes, and the weight of the edge being equated to the total number of calls shared between the antennae. Next, the generated ".net" file is imported into a graphical visualization unit so that a graphical representation of the data for the current hour can be generated.
The generated graphical representation is then analysed to obtain its structural properties (for example one or more structural properties) and hence form a single aggregated value by combining the values of the different one or more structural properties, with prediction analysis being used to check whether the single aggregated value is within a threshold value. If the current value exceeds the threshold, then the changes are reported. The changes may be reported along with the unique identifier (ID) of a sub-set of customers, for example the loyal customers (most influential users in the network), who get affected due to the available QoS at that current timing.
This report can be used by the service providers to perform efficient load balancing in a specific location in order to deal with the changes, and thus provide better services to the valued customers. These steps are executed in a
20
real-time distributed framework that is integrated with a visualization unit for visualizing the data streams.
It is noted that in another example application, the embodiments of the invention may be used to find out a disorientation in social network data, for example in a Twitter social network. As noted above, a disorientation can comprise, for example, any form of abnormal condition or situation, or the presence of loyal customers near to a highly transacted cell tower. Disorientation in a social network environment can comprise, for example, an abnormal condition such as the spread of some unwanted news very quickly, which could affect the integrity or security of a country or society, or some individual or company's reputation. It is noted that other forms of disorientation are intended to be embraced by the invention, as defined by the appended claims. According to one example the embodiments of the present invention may be used in an application that provides sentiment analysis for social network data, for example when analyzing data streams from a social network such as Twitter. Sentiment analysis or opinion m ining refers to the application of natural language processing, computational linguistics and text analytics to identify and extract subjective information in source materials. In one embodiment, a model such as a AFINN model (AFINN being an affective lexicon by Finn Arup Nielsen) can be used that lists various words and phrases rated for valence with an integer between -5 and +5, where -5 denotes the one with the most negative sentiment and +5, the most positive sentiment. Thus, this process can be used by embodiments of the invention in the analysis of the sentiment of a tweet in a Twitter data stream. The steps involved in a Twitter data stream analysis are explained further below. The location specific tweets through the Twitter Search API are retrieved
21
and passed on to the underlying component in the distributed processing topology. The retrieved tweets may be processed to remove certain words, for example stop words such as "a, an, the, those" etc., and then categorized based on the keywords present. The categories may comprise, for example, categories relating to a number of topics such as "Music, Sports, Politics and Others". Then, further analyses is performed on the sentiment of the keywords by grouping them as "happy or sad" by the use of a sentiment analysis model. The categorized specific location tweets can be transformed into a ".net" file to import into the graphical visualization unit for graph generation purpose. In the user graph generated for each category, the nodes represent the users and the edges, the total number of tweets shared between the users. The generated user graph can then be analyzed to obtain its structural properties, and thereby perform ing a regression analysis to check for the possible presence of a disorientation. Thus, this process is run in a parallelized real-time distributed framework interfaced with a visualizer by visualizing the twitter data stream.
Next, in relation to Figures 7a to 7j there will be described the results of an example of an application relating to the use of mobile phone customers' data stream using the distributed Storm framework and Gephi toolkit. In the example the regression analysis on the training data set resulted in the following equation:
Estimated Value = 0.775*AverageWeightedDegree + 1.166*AveragePathl_ength
- 1 .127*Modularity + 0.658*ConnectedComponentsCount - 0.380*AverageClusteringCoefficient + 0.343*AverageDegree
In the above the estimated value represents an example of a disorientation that has happened in one situation, where coefficients denote the constants calculated and the estimated value refers to the abnormal value that is compared to the regular threshold value.
From the visualized graphs shown in Figures 7a to 7i there can be seen the evolving nature of the data streams in real time. The threshold value was fixed dynamically on-the-fly. Thus, the evolution of CDR data stream is traced by visualizing the same for each hour by means of a graph. Graphs 7a to 7i show the graphs at time windows corresponding to 12am, 01 am, 04am, 06am, 10am, 12pm , 1 pm, 2pm and 3pm, respectively. The sizes of the various nodes are used to rank "centrality" (for example in Figure 7a the node 701 therefore being more central than node 703) , wh ile different shad ing is used to show "modularity". The thickness of the edge determines how strong the link between the nodes is (for example, in Figure 7a the link 705 between nodes 707 and 709 is stronger than the link 71 1 between nodes 707 and 701 ). In the given data set, disorientations were found at 1 PM (Figure 7g), 2PM (Figure 7h) and 3PM (Figure 7i) for the given CDR data stream, and these disorientations would have been reported as change conditions or alarms, such that necessary actions can be taken, for example load balancing. The disorientations may be detected at these times by comparing the estimated value for the structural properties of these graphs with the threshold value (i.e. an aggregated estimated value based on the structural properties noted in an equation as shown on the previous page), with the aggregated estimated value exceeded the threshold value in the graphs for 1 pm , 2pm and 3pm . A disorientation is therefore detected as a change in the estimated value, when the estimated value goes above the threshold value. Calculation of the estimated value is based on the changes in each graphical (structural) property. After finding out the occurrence of a disorientation, the overloaded antennae involved can be identified and a list of users (for example loyal customers) who are using those particular antennae can also be generated, as illustrated in Figure 7j, to aid a service provider in taking out necessary actions. As such, according to such an embodiment the data stream is location specific, and the step of reporting a change condition can further comprise the steps of
23
determ ining one or more overloaded antennae associated with the specific location involved, identifying a sub-set of users that are operationally coupled to the one or more overloaded antennae, and changing a service parameter of at least one of the identified sub-set of users. The step of changing a service parameter may comprise the step of performing a location-based quality of service change.
It is noted that the application relating to the second-use case, i.e. the social network example described above, can be implemented in a similar way to report on disorientations.
Thus, from Figures 7a to 7i it can be seen that graphical representations with no changes are sparser in nature while on the contrary, graphs with changes are denser, while Figure 7j shows a list of most influential users connected to each antenna. This list can be used by service provider to take the necessary actions.
The embodiments of the invention provide an approach that integrates the distributed framework and a visualizer to visualize large streams of data by constructing graphs. The embodiments have been implemented with the capability to visualize the location specific CDR data stream in the form of graphs in a real-time distributed environment, for example to automatically or dynamically detect changes without affecting the loyal customers through load balancing.
Since a visual approach has been introduced through the generation of graphs, the spread of topics in a given location and the severity can be easily identified and used to avoid disorientation, and this can be used for other social networking research purposes.
It can be seen that each data stream generated from one single location is
24
processed on a sliding time window and then sent to a visualizer that is interfaced with the distributed framework to visualize the data in the form of a graph. The same process can be applied for multiple locations to track and provide intended QoS of the other valued customer's. The subsequent component (processing nodes) obtain the structural properties of the generated graph to arrive at a single aggregated value based on the prediction method. This may be performed by analyzing the test data that assigns a specific coefficient for each structural property, so as to be compared with a threshold value to detect the possible occurrence of changes in the system and activate alarms. To make the threshold value more dynamic, the system may be configured to periodically perform an analysis on the past data to fix the threshold values on-the-fly (which may involve some offline processing).
The embodiments of the invention described above provide a distributed framework having an integrated visualization unit, for visualizing large streams of data by constructing social graphs on different time windows.
The embodiments of the invention provide the capability to visualize the location specific data stream in the form of graphs in a real-time distributed environment.
In embodiments relating to social network data streams, changes in trends can be reported by means of graphical representation and by analysing the various structural properties of social graphs to track the changes. Structural properties of social graphs are analysed to determine change in trends, which can be tracked over time i.e. subsequent sliding windows
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality,
25
and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.
26
Claims
1 . A method of processing a data stream of a communication network in a distributed processing architecture comprising a plurality of processing units, the method comprising the steps of:
extracting data from the data stream, wherein the data is extracted for a particular time window of a sliding time window;
converting the extracted data into a format suitable for graphical representation;
generating a graphical representation of the converted extracted data;
determining an estimated value of at least one structural property of the graphical representation of the data;
comparing the estimated value of the at least one structural property with a threshold value; and
reporting a change condition based on the outcome of the comparison step.
2. A method as claimed in claim 1 , wherein the threshold value is determined during an initialization phase of operation, wherein the initialization phase of operation comprises the steps of:
retrieving a data stream relating to historical data;
generating a graphical representation of the historical data; determining an estimated value for one or more structural properties of the graphical representation of the historical data; and
analysing the estimated value of the one or more structural properties to generate the threshold value.
3. A method as claimed in claim 1 , wherein the method further comprises the step of updating the threshold value during use, by periodically performing the
27
steps of claim 2 for more recent historical data, thereby adjusting the threshold value dynamically during use.
4. A method as claimed in any one of the preceding claims, wherein the graphical representation of the data comprises a set of vertices (V) and a set of edges (E) between the set of vertices, wherein an edge (E ) connects a first vertex (V,) with a second vertex (Vj), and wherein an estimated value of a structural property of the graphical representation of the data comprises one or more of:
an average path length value, relating to the average number of steps along the shortest paths for all possible pairs of first and second vertices;
a connected component count value, relating to a sub-graphical representation of the graphical representation of the data, in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the main graphical representation;
an average clustering coefficient value, each clustering coefficient value providing an indication regarding the degree to which vertices in a graph tend to cluster together;
an average degree value, relating to the number of edges in a set of edges E in comparison to the number of vertices in the set of vertices V;
a graph density value, relating to a measure of how many edges are in a set of edges E compared to a maximum possible number of edges between vertices is the set of vertices V;
a modularity value, relating to a measure of the strength of division of a graph into modules; or
an average weighted degree value, relating to an average of the sum of weights of the edges of the nodes.
5. A method as claimed in any one of the preceding claims, wherein estimated values relating to two or more structural properties are combined to
28
provide a single aggregated estimated value, the single aggregated estimated value being compared with the threshold value.
6. A method as claimed in any one of claims 1 to 4, wherein an estimated value of a particular structural property is compared with a respective threshold value for that respective structural property.
7. A method as claimed in any one of the preceding claims, wherein the step of generating a graphical representation comprises the step of interfacing with a graphical visualization unit to generate the graphical representation of the data.
8. A method as claimed in any one of the preceding claims, wherein the data stream is received from one or more locations of the communications network.
9. A method as claimed in any one of the preceding claims, wherein the data stream is location specific, and wherein the step of reporting a change condition further comprises the steps of:
determining one or more overloaded antennae associated with the specific location involved;
identifying a sub-set of users that are operationally coupled to the one or more overloaded antennae; and
changing a service parameter of at least one of the identified sub-set of users.
10. A method as claim in claim 9, wherein the step of changing a service parameter comprises the step of performing a location-based quality of service change.
1 1 . A method as claimed in any one of the preceding claims, wherein the data stream comprises
data relating to call detail records of a telecommunications operator;
29
data relating to cell tower data of a telecommunications network;
data relating to user data of a telecommunications or communications network;
data relating to a social network operating in a communications or a telecommunications network.
12. A method as claimed in any one of the preceding claims, wherein the method steps are performed in a sequential manner at different processing units of the distributed processing architecture, such that one processing unit is configured to process data relating to one time window, while a preceding processing unit is configured to process data associated with a subsequent time window.
13. A distributed processing architecture for processing a data stream of a communications network, the distributed processing architecture comprising:
a first processing unit adapted to extract data from the data stream, wherein the data is extracted for a particular time window of a sliding time window;
a second processing unit adapted to receive the extracted data from the first processing unit, and convert the extracted data into a format suitable for graphical representation;
a third processing unit adapted to generate a graphical
representation of the converted extracted data; and
a fourth processing unit adapted to determine an estimated value of at least one structural property of the graphical representation of the data, and further adapted to compare the estimated value of the at least one structural property with a threshold value, and report a change condition based on the outcome of the comparison.
30
14. A distributed processing architecture as claimed in claim 13, wherein the third processing unit is configured to interface with a graphical visualization unit when generating the graphical representation of the converted extracted data.
15. A distributed processing architecture as claimed in claim 13 or 14, wherein the distributed processing architecture is further adapted to perform the method steps as defined in any one of claims 2 to 6, or 8 to 12.
31
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/915,088 US10225165B2 (en) | 2013-08-26 | 2013-08-26 | Apparatus and method for processing data streams in a communication network |
PCT/SE2013/050994 WO2015030637A1 (en) | 2013-08-26 | 2013-08-26 | Apparatus and method for processing data streams in a communication network |
EP13765816.7A EP3039821B1 (en) | 2013-08-26 | 2013-08-26 | Apparatus and method for processing data streams in a communication network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/SE2013/050994 WO2015030637A1 (en) | 2013-08-26 | 2013-08-26 | Apparatus and method for processing data streams in a communication network |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015030637A1 true WO2015030637A1 (en) | 2015-03-05 |
Family
ID=49226488
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SE2013/050994 WO2015030637A1 (en) | 2013-08-26 | 2013-08-26 | Apparatus and method for processing data streams in a communication network |
Country Status (3)
Country | Link |
---|---|
US (1) | US10225165B2 (en) |
EP (1) | EP3039821B1 (en) |
WO (1) | WO2015030637A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10565311B2 (en) | 2017-02-15 | 2020-02-18 | International Business Machines Corporation | Method for updating a knowledge base of a sentiment analysis system |
CN113993001A (en) * | 2021-09-08 | 2022-01-28 | 四创电子股份有限公司 | Real-time streaming analysis alarm method based on sliding data window |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9753825B2 (en) | 2008-06-04 | 2017-09-05 | Oracle International Corporation | System and method for using an event window for testing an event processing system |
US10102091B2 (en) | 2008-06-04 | 2018-10-16 | Oracle International Corporation | System and method for supporting a testing framework for an event processing system using multiple input event streams |
US20170063723A1 (en) * | 2015-08-26 | 2017-03-02 | International Business Machines Corporation | Asset arrangement management for a shared pool of configurable computing resources associated with a streaming application |
US10057349B2 (en) * | 2015-11-12 | 2018-08-21 | Facebook, Inc. | Data stream consolidation in a social networking system for near real-time analysis |
JP6607571B2 (en) * | 2016-07-28 | 2019-11-20 | 株式会社東海理化電機製作所 | Manufacturing method of semiconductor device |
US10853380B1 (en) | 2016-07-31 | 2020-12-01 | Splunk Inc. | Framework for displaying interactive visualizations of event data |
US10459938B1 (en) | 2016-07-31 | 2019-10-29 | Splunk Inc. | Punchcard chart visualization for machine data search and analysis system |
US10459939B1 (en) | 2016-07-31 | 2019-10-29 | Splunk Inc. | Parallel coordinates chart visualization for machine data search and analysis system |
US11037342B1 (en) * | 2016-07-31 | 2021-06-15 | Splunk Inc. | Visualization modules for use within a framework for displaying interactive visualizations of event data |
US10861202B1 (en) | 2016-07-31 | 2020-12-08 | Splunk Inc. | Sankey graph visualization for machine data search and analysis system |
WO2018144030A1 (en) * | 2017-02-06 | 2018-08-09 | The Texas State University-San Marcos | System and method for identifying maximal independent sets in parallel |
US10698625B2 (en) * | 2017-05-15 | 2020-06-30 | Accenture Global Solutions Limited | Data pipeline architecture for analytics processing stack |
US11676156B2 (en) * | 2021-05-24 | 2023-06-13 | Liveperson, Inc. | Data-driven taxonomy for annotation resolution |
CN113721121B (en) * | 2021-09-02 | 2024-04-19 | 长江存储科技有限责任公司 | Fault detection method and device for semiconductor process |
US20230319590A1 (en) * | 2022-03-31 | 2023-10-05 | Rakuten Symphony Singapore Pte. Ltd. | Method, apparatus, and computer readable medium |
CN114860824A (en) * | 2022-04-11 | 2022-08-05 | 远景智能国际私人投资有限公司 | Data transmission method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001098916A1 (en) * | 2000-06-21 | 2001-12-27 | Concord Communications, Inc. | Liveexception system |
US20060245432A1 (en) * | 2005-04-28 | 2006-11-02 | Kroboth Robert H | Method and apparatus for depicting quality of service by base station in mobile networks |
US20070036275A1 (en) * | 2005-02-08 | 2007-02-15 | Michael Bluemche | Protocol tester for a telecommunication system and method for presenting transmission-relevant information relating to a telecommunication system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060075335A1 (en) * | 2004-10-01 | 2006-04-06 | Tekflo, Inc. | Temporal visualization algorithm for recognizing and optimizing organizational structure |
US8812661B2 (en) * | 2011-08-16 | 2014-08-19 | Facebook, Inc. | Server-initiated bandwidth conservation policies |
-
2013
- 2013-08-26 WO PCT/SE2013/050994 patent/WO2015030637A1/en active Application Filing
- 2013-08-26 EP EP13765816.7A patent/EP3039821B1/en active Active
- 2013-08-26 US US14/915,088 patent/US10225165B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001098916A1 (en) * | 2000-06-21 | 2001-12-27 | Concord Communications, Inc. | Liveexception system |
US20070036275A1 (en) * | 2005-02-08 | 2007-02-15 | Michael Bluemche | Protocol tester for a telecommunication system and method for presenting transmission-relevant information relating to a telecommunication system |
US20060245432A1 (en) * | 2005-04-28 | 2006-11-02 | Kroboth Robert H | Method and apparatus for depicting quality of service by base station in mobile networks |
Non-Patent Citations (1)
Title |
---|
SEDLAR U ET AL: "Contextualized monitoring and root cause discovery in IPTV systems using data visualization", IEEE NETWORK, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 26, no. 6, 1 November 2012 (2012-11-01), pages 40 - 46, XP011487753, ISSN: 0890-8044, DOI: 10.1109/MNET.2012.6375892 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10565311B2 (en) | 2017-02-15 | 2020-02-18 | International Business Machines Corporation | Method for updating a knowledge base of a sentiment analysis system |
US11755841B2 (en) | 2017-02-15 | 2023-09-12 | International Business Machines Corporation | Method for updating a knowledge base of a sentiment analysis system |
CN113993001A (en) * | 2021-09-08 | 2022-01-28 | 四创电子股份有限公司 | Real-time streaming analysis alarm method based on sliding data window |
CN113993001B (en) * | 2021-09-08 | 2024-04-12 | 四创电子股份有限公司 | Real-time stream analysis alarm method based on sliding data window |
Also Published As
Publication number | Publication date |
---|---|
US10225165B2 (en) | 2019-03-05 |
EP3039821B1 (en) | 2016-12-21 |
US20160212023A1 (en) | 2016-07-21 |
EP3039821A1 (en) | 2016-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3039821B1 (en) | Apparatus and method for processing data streams in a communication network | |
US8861691B1 (en) | Methods for managing telecommunication service and devices thereof | |
US9832280B2 (en) | User profile configuring method and device | |
JP2019521427A (en) | Network Advisor Based on Artificial Intelligence | |
US20150207696A1 (en) | Predictive Anomaly Detection of Service Level Agreement in Multi-Subscriber IT Infrastructure | |
US11381471B2 (en) | System and method for predicting and handling short-term overflow | |
CN110460608B (en) | Situation awareness method and system including correlation analysis | |
CN109120463A (en) | Method for predicting and device | |
CN116166505B (en) | Monitoring platform, method, storage medium and equipment for dual-state IT architecture in financial industry | |
Solmaz et al. | ALACA: A platform for dynamic alarm collection and alert notification in network management systems | |
CN115022153A (en) | Fault root cause analysis method, device, equipment and storage medium | |
CN109995558A (en) | Failure information processing method, device, equipment and storage medium | |
CN114969366A (en) | Network fault analysis method, device and equipment | |
TWI662809B (en) | Obstacle location system and maintenance method for image streaming service | |
EP4149075B1 (en) | Automatic suppression of non-actionable alarms with machine learning | |
Fahd et al. | A framework for real-time sentiment analysis of big data generated by social media platforms | |
Rueda et al. | Big data streaming analytics for qoe monitoring in mobile networks: A practical approach | |
Bingöl et al. | Topic-based influence computation in social networks under resource constraints | |
Yayah et al. | Adopting big data analytics strategy in telecommunication industry | |
CN116155541A (en) | Automatic machine learning platform and method for network security application | |
Lashari et al. | Monitoring public opinion by measuring the sentiment of retweets on Twitter | |
CN115514618A (en) | Alarm event processing method and device, electronic equipment and medium | |
Giakatos et al. | Benchmarking graph neural networks for internet routing data | |
Streiffer et al. | Learning to simplify distributed systems management | |
Yousef et al. | On the use of predictive analytics techniques for network elements failure prediction in telecom operators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13765816 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14915088 Country of ref document: US |
|
REEP | Request for entry into the european phase |
Ref document number: 2013765816 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2013765816 Country of ref document: EP |