EP2686782A1

EP2686782A1 - Method and device for recognizing and tagging of peaks, increases, or abnormal or exceptional variations in the throughput of a stream of digital documents

Info

Publication number: EP2686782A1
Application number: EP12710075.8A
Authority: EP
Inventors: Jean-Charles Campagne; Paul Guyot; David JULIEN
Original assignee: Semiocast
Current assignee: Semiocast
Priority date: 2011-03-18
Filing date: 2012-03-16
Publication date: 2014-01-22
Also published as: US20150205862A1; FR2972822A1; WO2012146440A1

Abstract

The invention relates to a method and a device which make it possible to produce an explanatory view of the changes in the throughput of a stream of documents, or to alert an operator by indicating the main subjects of the abnormal variations in said throughput. The device implements a process (1) of recognizing the periods during which the throughput of a stream of documents varies abnormally, a process (2) of morphologically analyzing text, a process (3) of determining, for a given period, the character strings of which the frequencies are the highest for the documents of the period, and a process (4) of building a tag from the strings identified by said process (3). The device can be coupled to an alarm (23) or display (22) system. The method and the device according to the invention are particularly intended for monitoring over social networks.

Description

METHOD AND APPARATUS FOR TRACKING AND LABELING

PICS, INCREASES OR ABNORMAL OR EXCEPTIONAL VARIATIONS IN THE FLOW OF A DIGITAL DOCUMENTS STREAM

The field of the invention is telecommunications and, in particular, the analysis of digital document flows. The invention also applies to the analysis of large masses of digital documents. These digital documents may be e-mails, short GSM messages, messages, articles or comments posted on Internet sites, blogs, forums or social networks, instant messages and any other type of message or digital document posted or published. , in the form of text or which has a text or which can be analyzed by a device generating a text such a voice recognition device. These digital documents can be addressed specifically or implicitly to recipients or made public for a community or for everyone. These digital documents are associated with one or more dates5 of publication, sending or modification.

The present invention relates to a method and a device for identifying and labeling abnormal or exceptional peaks, increases or variations in the throughput of a digital document stream from one or more social networks or a collection of blogs or websites, to alert an operator or produce a synthetic and explanatory view of the evolution of the flow.

The general problem is to produce a synthetic and explanatory view of the evolution of the flow rate of a flow of digital documents, or to alert an operator by indicating the main subject or subjects of abnormalities or abnormal or exceptional variations thereof. debit.

There are a number of devices that produce or produce graphs with abscissa time and ordinate rate (number of documents per time unit). These devices make it possible to explore the documents that have been published at a given moment or during a given period. These devices sometimes make it possible to highlight peaks, increases or abnormal or exceptional variations5 of the flow or to alert an operator during such variations. Such devices implement different methods for dating and measuring abnormal or exceptional variations in the flow rate of a document flow. One of these methods is to compare

1

SUBSTITUTE SHEET (RULE 26) the flow at a given moment with the average flow over a longer period. More advanced methods rely on transformations, for example on discrete wavelet transforms, as described by A. Haar in the article "Zur Théorie der orthogonalen Funkti onensysteme" published in Mathe- matische Annalen 69 en 1910, no. 3, pages 331-371; there is abundant literature on the detection of peaks from such transformations, such as international patent application WO 2010/007486, or anomalies, such as CT communication Huang et al. entitled "Wavelet-based Real Time Detection of Network Traffic Anomalies" in Securecomm and Workshops published in 2006 by IEEE.

These traditional methods of detecting abnormal or exceptional variations in the flow of a document flow do not make it possible to obtain qualitative information to explain these variations. To qualify these variations, and in particular to associate them with an external event such as a communication operation or a crisis, the operator must traditionally explore the documents that make up the peaks observed. This task can be particularly tedious. In particular, when the usual flow of documents is important, for example several thousand documents per hour, a significant part and a number of documents do not relate to the subject of the peak and can not explain the variations. The operator can be easily overwhelmed by the mass of documents.

There are also devices to determine the topics that are most present in a document flow, or subjects whose presence increases significantly. For example, sites like SEARCH.TWITTER.COM, TWIRUS.COM or BING.C0M display lists of topics "trends of the moment" on social networks. These lists would be constructed from the derivative of the frequency of morphemes or groups of morphemes in the analyzed documents, as described in the C. Penner blog post, entitled "To Trend or Not to Trend ..." and published in 2010 on BL0G.TWITTER.COM. Other techniques for building such lists rely on a measure of entropy or on the product called "TF-IDF" and are described in J. Benhardus' report, "Streaming Trend Detection in Twitter" and published at the 2010 UCCS REU for Artificial Intelligence, National Language Processing and Information Retrieval. The "TF-IDF" Product and Term Weighting Methods are described in particular in "Term-wei ghg ng approaches in automatic text retrieval" by Salton, G. et al. published in 1988 in Information Processing and Management, Vol. 24, N. 5, pages 513 to 523. However, if these techniques make it possible to highlight words or groups of words in a stream as wide as the flow of Twitter public messages at a given instant, they do not allow to produce a synthetic and explanatory view of the peaks or significant variations in the flow rate of the flow of documents relating to a given subject.

The solutions known in the prior art to enable an analyst, for example a sociologist, a researcher or a commentator, to capture very large volumes of information, in real time, and without presuppositions on explanatory models , are technically limited by the volume of data taken into account.

In known solutions, it is necessary to have technical equipment to access a digital flow of messages and documents: these are tools installed on servers, firstly permanently connected to streams a continuous flow of digital information circulating on social networks, via a software application (API), and on the other hand, which continually interrogate the servers hosting blogs or forums-type internet sites in order to locally download the selected digital documents. All of these collected documents constitute a flow of digital documents.

This technical equipment generates huge amounts of digital data, ranging from tens of thousands to millions of messages or articles per day.

A possible solution for understanding these volumes of information would be to read and classify each of the digital documents in order to deduce an interpretation from a human analysis. This solution is not reasonable when the throughput is very high (several thousand to several million of documents per day) or that the explanation must be provided within a short time, of the order of a minute.

The known means make it possible to construct an evolution curve of the information rate over time for the whole of a selected stream or for a selection corresponding to predetermined specifications (for example the flow of documents which contain such a word or combination of terms). Thus, powerful computing resources make it possible to verify hypotheses previously fixed by the analyst, assumptions which must be constructed in the form of predefined queries and which determine a subset of the flow.

In other words, these powerful computing resources do not offer the analyst explanatory models. They only allow to test hypotheses. The person skilled in the art, confronted with this technical problem, would have an approach of infinitely multiplying the hypotheses, to test them one after the other, and check the one which, applied to the available data volume, generates a coherent curve with an explanatory model.

It is obvious that the treatment time increases exponentially as a function of:

- the number of digital documents to be processed;

- the number of hypotheses submitted by the analyst.

As a result, the analyst would not be able to learn relevant lessons in a timely manner.

The technical problem that arises is that of the real-time processing capacity of large volumes of data to carry out analyzes that make it possible to explain changes in volume or flow rate indicative of external events.

The method according to the invention overcomes the disadvantages of traditional methods. For this, the invention proposes a technical method, executed by a computer, comprising a succession of processing steps:

a step of processing the data flow to characterize quantitative peaks (this step identifies the changes in regime in the flow). This step involves determining one or more time intervals and ordering recording the digital documents corresponding to these intervals for subsequent processing;

the second step concerns a technical processing consisting in extracting sequences of characters from the documents thus isolated, by cutting the texts into strings of characters and recording in another memory zone the strings of characters thus identified;

- the third step is to create an index of strings extracted from the second step, to associate the relevant documents and a quantitative indicator that measures the importance of a string of characters in these documents relative to the flow and then to determine the most important strings in relation to this quantitative indicator;

the last step is to provide a label constructed from the documents associated with the strings identified in the third step.

It is not a question of simple intellectual methods, because a human operator would be in no way able to realize all these different stages and treatments. In addition, all these steps involve digital data, having no direct cognitive reality.

The invetion comprises, according to a first characteristic:

a first method for identifying the periods when the flow rate of the digital document stream varies abnormally or exceptionally, or peaks or increases significantly;

a second method of morphological analysis making it possible to extract strings of characters from a digital document and to distinguish, among these strings of characters, those which correspond to the morphemes or groups of morphemes from those which correspond to the separators between the morphemes or groups of morphemes;

a third method for determining, for each of these periods, strings of characters extracted by the above method whose frequencies are highest for digital documents during each period distinguished by the first method compared to digital documents outside these periods; a fourth method making it possible to construct, for all or for a subset of the periods distinguished by the first method, a label from the totality or a sample of the digital documents for this given period and a subset or all the strings of characters distinguished by the preceding method.

According to particular embodiments:

the first method operates according to a high pass filter based on wavelets. The documents are counted per time unit (hour, day), and the sequence thus determined forms a signal on which a filtering is performed by eliminating the coefficients of the discrete wavelet decomposition which are below a certain threshold in absolute value. . The distinguished periods are defined as periods during which the signal recomposed after filtering has a strictly positive value. Compared to the naive and obvious approach for the skilled person to compare the number of documents per unit time compared to the average, this approach has the double advantage of identifying peaks or exceptional increases even when the average flow is high but the recent flow is lower than average, and limit peak periods more precisely than just exceeding the average;

the first method works by comparing the signal with a periodic or quasi-periodic model. Such a model is established a priori, for example as the linear combination of several periodic functions of period of 24 hours or 7 days. The model coefficients are obtained by the least squares method from the historical data. Distinguished periods are defined as the periods during which the difference between the signal and the model is greater than a certain threshold. This approach has the same advantages than the previous approach compared to the naive approach. It also makes it possible to detect smaller peaks more precisely, especially when the signal is highly periodic, as can be seen on social networks where the activity is highly dependent on the diurnal and weekly rhythm. On the other hand, compared to the previous approach, this approach has the disadvantage of being heavier in calculation and requiring the development of a model for the analyzed flow. This approach also does not detect peaks that would be recurrent and periodic on the historical data;

the second method is a cutting of the digital documents according to the spaces and the punctuation. This approach has the advantage of being very simple and easy to implement. The cutting thus produced does not correspond to a very precise morphological analysis but is sufficient, in the context of the invention, to obtain labels for each of the peaks, increases or abnormal or exceptional variations in the flow rate;

the second method is a cutting of digital documents according to a segmentation model based on statistical data, grammatical rules, dictionary or hidden Markov chains. Such a method could for example be that described in patent JP2897942. This approach has the advantage of being able to extract strings of digital documents written in languages where the words are generally not separated by spaces or punctuations, such as Japanese, Chinese or Thai;

the second method consists of a first step of identifying the language of the digital document and then a set of methods for separating the specialized words for each of the languages processed. This approach advantageously makes it possible to process a stream of digital documents written in different languages;

the third method works by eliminating strings of characters determined by the second method those which appear in a list of empty words or tool words. This approach has the advantage of avoiding constructing labels from empty words or tool words;

the third method works by calculating the product "TF-IDF" for the occurrences of the character strings extracted by the second method, then selecting the channel or chains for which this product is the highest;

the fourth method works by searching the character string composed of a set of morphemes distinguished by the second method and present in the digital documents which maximizes a function defined as the sum of the frequencies of the set of substrings of characters of this chain in all digital documents;

the process as a whole is implemented in a device which presents the operator with a graph of the flow rate and highlights the main peaks, increases or abnormal or exceptional variations in the flow rate and displays, statically or interactively, labels associated with these abnormal or exceptional peaks, increases or variations;

the method as a whole is implemented in a device coupled with a parameterizable filtering system which presents to the operator a graph of the flow rate of a subset of the analyzed flow, highlights the main peaks, increases or variations; abnormal or exceptional flow and associates them with labels. This device advantageously allows the operator to adjust the filtering to analyze more particularly the flow rate with respect to these peaks, to obtain more information on these peaks or the rest of the curve, and possibly reveal other peaks;

the process as a whole is implemented in a device coupled to an alert or notification system.

Other advantages and features of the invention will be apparent from the description of a preferred embodiment which follows with reference to the accompanying drawings in which: - Figure 1 shows a device implementing the various methods;

FIG. 2 represents a device which presents to the operator a graph of the flow rate by highlighting the main peaks, increases or abnormal or exceptional variations in the flow rate and which is coupled to a noti fication system;

- Figure 3 shows a graph as generated by said di sposi ti f. FIG. 1 represents the composition of the various processes and the flow (11) of digital documents through a device according to the invention.

The digital documents are initially stored in alphanumeric form in a table of a relational database (10). Each digital document is stored on a line comprising a column with the text of the document, and a column with the date of publication of the document if it exists, or the date on which the document was retrieved, otherwise. For reasons of speed, the relational database is configured to index the column of the date with an ordered index, for example in the form of a tree of type B-Tree.

When the operator (27) interrogates the idle device (26), the device (12), at first, queries the relational database using the aforementioned index, to count, for each period of time (hour or day), the number of documents stored in the database, on a window chosen by the operator. This information makes it possible to draw the flow rate curve of the documents on the terminal (22) and an example of which is represented in FIG. 3. This curve can synthesize a very large mass of documents. This curve can be refreshed in real time when new documents are stored in the relational database (10).

The device (12) implements the method (1) to identify periods of peaks, increases or abnormal or exceptional variations. These peak periods can be highlighted by a marker (31) at the local maximum on the interface of the terminal (22). At the same time, the device (13) queries the relational database by using the aforementioned index to implement the method (2) in order to associate, with each document, a sequence of character strings representing a morpheme or a group of morphemes. The documents, associated with these string sequences, and the identified periods are then used by a device (14) implementing the method (3) to determine the most frequent character strings in each period identified with respect to the set of documents. This method (3) works by first eliminating words that are part of stop word lists, then for each of the character strings, the device calculates the product called "TF-IDF" and retains the n strings for which this product is the highest, n being a parameter of the process whose value can for example be 5.

Finally, the documents, associated with string sequences representing a morpheme or group of morphemes, as well as the n most frequent character strings for each identified period, are used by a device (15) implementing the method (4) constructing, for each period, an associated tag (30). This tag (30) is constructed by looking for the character string that includes one or more of the n strings retained by the device (14), which is included in the documents of the period, which is composed of a set of morphemes distinguished by the device (13), and which maximizes the function defined as the sum of the frequencies of all the substrings of characters in all the documents processed by the device (12).

FIG. 2 shows the integration of the various methods of the invention into a wider standby device (26). A number of streams are published on the Internet (25) and are captured and stored in a relational database (10). These streams are filtered by a device (21) that determines the messages on a given subject. The documents are then processed by a device (20) implementing a method according to the invention. This device presents to the operator (27) a graph like that shown in FIG. 3 on the terminal (22). This graph shows a number of labels (30) allowing the operator (27) to interpret the abnormal or exceptional peaks and variations of the flow rate. This operator (27) can then modify the parameters of the filtering device (21) via a feedback loop (24). The device (20) then produces a new curve (34) representing the flow rate of the stream defined by the filter parameters. This new curve has new peaks, increases or abnormal or exceptional variations that the device (20) identifies and for which it produces new labels (30).

The device (20) is also coupled to a notification system that allows the operator (28) to receive an alert on the terminal (23) when the flow rate of the flow has a peak, an increase or an abnormal variation. This alert is associated with a tag (30) that allows the operator (28) to determine the cause of the peak and to decide whether it is necessary to analyze this variation via the terminal (22) or by searching in the digital documents which constitute the stream and which are stored in the database (10).

FIG. 3 represents a graph as generated by a device according to the invention. The signal is represented in the form of a graph with abscissa (32), time, and ordinate, the flow per unit time (33). This signal forms a curve (34) with peaks identified by the method (1) and highlighted by a marker at the local maximum (31). These markers are associated with the labels (30).

In another example of implementation of the invention, the morphemes or groups of morphemes are first extracted from the digital documents, which are stored in a relational database with the associated list of morphemes, before the process ( 1) identifies abnormal or exceptional peaks, increases or variations.

In another exemplary implementation of the invention, the device (13), when the volume of documents is too important to obtain a response within a reasonable time for the operator, queries the relational database (10) for recovering only a uniform pseudo-random sample of digital documents. In another example, this random sample is skewed to favor periods of peaks and of recesses revealed by the device (12). It was found that sampling is justified when the number of digital documents recorded in the relational database (10) and corresponding to the selection of the operator exceeds 10,000. In this case, the sample is 10,000, independently the actual volume of documents saved in the database.

In another exemplary implementation of the invention, the relational database (10) is replaced by a buffer memory which may contain a certain number of digital documents and covering a sufficient period with respect to the interrogations of the operator.

In another exemplary embodiment of the invention, the digital documents are multimedia documents, and the method (2) of morphological analysis is composed of a text extraction method by speech recognition or by optical recognition.

In another exemplary embodiment of the invention, the morphological analysis method (2) is coupled to an automatic translation method.

The method and the device according to the invention are particularly intended for community monitoring on social networks.

Claims

Method for identifying and labeling the main peaks, increases or abnormal or exceptional variations in the flow of a flow of digital documents, stored initially in a database (10), characterized in that it is composed of:

a method (1) making it possible to identify periods where the flow rate of this flow of digital documents varies in an abnormal or exceptional manner, or forms a peak or increases significantly;

of a method (2) of morphological analysis making it possible to extract character strings from a digital document and to distinguish, among these character strings, those which correspond to morphemes or groups of morphemes and those which correspond to separators between morphemes or groups of morphemes;

of a method (3) making it possible to determine, for each of the periods identified by the method (1), among the character strings extracted by the method (2) from the digital documents of the period, those whose frequencies are the highest for digital documents from the period compared to digital documents outside the period;

of a method (4) making it possible to construct, for each period identified by the method (1), a label from all or a sample of the digital documents of the period, cut according to the method

(2), and a subset or all of the character strings determined by the method

(3).

Method according to claim 1 characterized in that the method making it possible to distinguish periods where the flow varies abnormally or exceptionally operates according to a high pass filter based on discrete wavelets.

Method according to claim 1 characterized in that the method making it possible to distinguish periods where the flow varies abnormally or exceptionally operates by the calculation of the residual with a periodic or quasi-periodic flow model whose parameters are calculated by the least squares method.

4. Method according to any one of the preceding claims, characterized in that the morphological analysis method consists of a first method of identifying the language of the digital document then a set of methods of separating specialized words for each of the languages processed.

5. Method according to any one of the preceding claims characterized in that the method making it possible to determine the character strings whose frequencies are the highest for digital documents during each period identified by the method (1) operates by starting by eliminating character strings those which are present in a list, called a list of stop words or outi 1s words.

6. Method according to any one of the preceding claims characterized in that the method making it possible to determine the character strings extracted by the method

(2) whose frequencies are the highest for digital documents during each period identified by the method (1) works by calculating the product “TF-IDF” from the occurrences of the character strings extracted by the method (2) for digital materials in the period versus digital materials outside that period, then selecting the channel(s) for which that revenue is highest.

7. Method according to any one of the preceding claims, characterized in that the method making it possible to construct a label from digital documents and a subset of the character strings resulting from these documents, operates by searching for the string of character, present in digital documents and composed of a set of morphemes distinguished by the method (2), which maximizes the function defined as the sum of the fre- quences of all character substrings in all digital documents.

8. Device implementing a method according to any one of the preceding claims, characterized in that it presents to the operator a graph of the flow rate and highlights the main peaks, increases or abnormal or exceptional variations in the flow rate and displays, statically or interactively, labels associated with these abnormal or exceptional peaks, increases or variations.

9. Device implementing the method according to any one of claims 1 to 7 characterized in that it is coupled to a configurable filtering system which presents to the operator a graph of the flow rate of the subset of the flow resulting from the filtering, highlighting the main peaks, increases or abnormal or exceptional variations in flow and associating them with labels, and allowing the operator to adjust the filtering to more particularly analyze the flow in relation to these peaks, in order to get more information about these peaks or the rest of the curve, and possibly reveal other peaks.

10. Device implementing a method according to any one of claims 1 to 7 characterized in that it is coupled to an alert or notification system.