EP2686782A1 - Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms - Google Patents
Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstromsInfo
- Publication number
- EP2686782A1 EP2686782A1 EP12710075.8A EP12710075A EP2686782A1 EP 2686782 A1 EP2686782 A1 EP 2686782A1 EP 12710075 A EP12710075 A EP 12710075A EP 2686782 A1 EP2686782 A1 EP 2686782A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- documents
- flow
- digital documents
- peaks
- digital
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 94
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 23
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 8
- 230000000877 morphologic effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 230000000737 periodic effect Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 11
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 238000013459 approach Methods 0.000 description 14
- 230000008901 benefit Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000011282 treatment Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/12—Classification; Matching
- G06F2218/14—Classification; Matching by matching peak patterns
Definitions
- the field of the invention is telecommunications and, in particular, the analysis of digital document flows.
- the invention also applies to the analysis of large masses of digital documents.
- These digital documents may be e-mails, short GSM messages, messages, articles or comments posted on Internet sites, blogs, forums or social networks, instant messages and any other type of message or digital document posted or published. , in the form of text or which has a text or which can be analyzed by a device generating a text such a voice recognition device.
- These digital documents can be addressed specifically or implicitly to recipients or made public for a community or for everyone.
- These digital documents are associated with one or more dates5 of publication, sending or modification.
- the present invention relates to a method and a device for identifying and labeling abnormal or exceptional peaks, increases or variations in the throughput of a digital document stream from one or more social networks or a collection of blogs or websites, to alert an operator or produce a synthetic and explanatory view of the evolution of the flow.
- the general problem is to produce a synthetic and explanatory view of the evolution of the flow rate of a flow of digital documents, or to alert an operator by indicating the main subject or subjects of abnormalities or abnormal or exceptional variations thereof. debit.
- SUBSTITUTE SHEET (RULE 26) the flow at a given moment with the average flow over a longer period.
- More advanced methods rely on transformations, for example on discrete wavelet transforms, as described by A. Haar in the article "Zur Théorie der orthogonalen Funkti onensysteme” published in Mathe- matician Annalen 69 en 1910, no. 3, pages 331-371; there is abundant literature on the detection of peaks from such transformations, such as international patent application WO 2010/007486, or anomalies, such as CT communication Huang et al. entitled “Wavelet-based Real Time Detection of Network Traffic Anomalies" in Securecomm and Workshops published in 2006 by IEEE.
- This technical equipment generates huge amounts of digital data, ranging from tens of thousands to millions of messages or articles per day.
- the known means make it possible to construct an evolution curve of the information rate over time for the whole of a selected stream or for a selection corresponding to predetermined specifications (for example the flow of documents which contain such a word or combination of terms).
- predetermined specifications for example the flow of documents which contain such a word or combination of terms.
- the technical problem that arises is that of the real-time processing capacity of large volumes of data to carry out analyzes that make it possible to explain changes in volume or flow rate indicative of external events.
- the method according to the invention overcomes the disadvantages of traditional methods.
- the invention proposes a technical method, executed by a computer, comprising a succession of processing steps:
- this step identifies the changes in regime in the flow. This step involves determining one or more time intervals and ordering recording the digital documents corresponding to these intervals for subsequent processing;
- the second step concerns a technical processing consisting in extracting sequences of characters from the documents thus isolated, by cutting the texts into strings of characters and recording in another memory zone the strings of characters thus identified;
- the third step is to create an index of strings extracted from the second step, to associate the relevant documents and a quantitative indicator that measures the importance of a string of characters in these documents relative to the flow and then to determine the most important strings in relation to this quantitative indicator;
- the last step is to provide a label constructed from the documents associated with the strings identified in the third step.
- the invetion comprises, according to a first characteristic:
- the first method operates according to a high pass filter based on wavelets.
- the documents are counted per time unit (hour, day), and the sequence thus determined forms a signal on which a filtering is performed by eliminating the coefficients of the discrete wavelet decomposition which are below a certain threshold in absolute value. .
- the distinguished periods are defined as periods during which the signal recomposed after filtering has a strictly positive value. Compared to the naive and obvious approach for the skilled person to compare the number of documents per unit time compared to the average, this approach has the double advantage of identifying peaks or exceptional increases even when the average flow is high but the recent flow is lower than average, and limit peak periods more precisely than just exceeding the average;
- the first method works by comparing the signal with a periodic or quasi-periodic model.
- a periodic or quasi-periodic model is established a priori, for example as the linear combination of several periodic functions of period of 24 hours or 7 days.
- the model coefficients are obtained by the least squares method from the historical data.
- Distinguished periods are defined as the periods during which the difference between the signal and the model is greater than a certain threshold.
- the second method is a cutting of the digital documents according to the spaces and the punctuation.
- This approach has the advantage of being very simple and easy to implement.
- the cutting thus produced does not correspond to a very precise morphological analysis but is sufficient, in the context of the invention, to obtain labels for each of the peaks, increases or abnormal or exceptional variations in the flow rate;
- the second method is a cutting of digital documents according to a segmentation model based on statistical data, grammatical rules, dictionary or hidden Markov chains.
- a segmentation model based on statistical data, grammatical rules, dictionary or hidden Markov chains.
- This approach has the advantage of being able to extract strings of digital documents written in languages where the words are generally not separated by spaces or punctuations, such as Japanese, Chinese or Thai;
- the second method consists of a first step of identifying the language of the digital document and then a set of methods for separating the specialized words for each of the languages processed.
- the third method works by eliminating strings of characters determined by the second method those which appear in a list of empty words or tool words. This approach has the advantage of avoiding constructing labels from empty words or tool words;
- the third method works by calculating the product "TF-IDF" for the occurrences of the character strings extracted by the second method, then selecting the channel or chains for which this product is the highest;
- the fourth method works by searching the character string composed of a set of morphemes distinguished by the second method and present in the digital documents which maximizes a function defined as the sum of the frequencies of the set of substrings of characters of this chain in all digital documents;
- the process as a whole is implemented in a device which presents the operator with a graph of the flow rate and highlights the main peaks, increases or abnormal or exceptional variations in the flow rate and displays, statically or interactively, labels associated with these abnormal or exceptional peaks, increases or variations;
- the method as a whole is implemented in a device coupled with a parameterizable filtering system which presents to the operator a graph of the flow rate of a subset of the analyzed flow, highlights the main peaks, increases or variations; abnormal or exceptional flow and associates them with labels.
- This device advantageously allows the operator to adjust the filtering to analyze more particularly the flow rate with respect to these peaks, to obtain more information on these peaks or the rest of the curve, and possibly reveal other peaks;
- FIG. 2 represents a device which presents to the operator a graph of the flow rate by highlighting the main peaks, increases or abnormal or exceptional variations in the flow rate and which is coupled to a noti fication system;
- FIG. 1 represents the composition of the various processes and the flow (11) of digital documents through a device according to the invention.
- the digital documents are initially stored in alphanumeric form in a table of a relational database (10).
- Each digital document is stored on a line comprising a column with the text of the document, and a column with the date of publication of the document if it exists, or the date on which the document was retrieved, otherwise.
- the relational database is configured to index the column of the date with an ordered index, for example in the form of a tree of type B-Tree.
- the device (12) When the operator (27) interrogates the idle device (26), the device (12), at first, queries the relational database using the aforementioned index, to count, for each period of time (hour or day), the number of documents stored in the database, on a window chosen by the operator.
- This information makes it possible to draw the flow rate curve of the documents on the terminal (22) and an example of which is represented in FIG. 3.
- This curve can synthesize a very large mass of documents. This curve can be refreshed in real time when new documents are stored in the relational database (10).
- the device (12) implements the method (1) to identify periods of peaks, increases or abnormal or exceptional variations. These peak periods can be highlighted by a marker (31) at the local maximum on the interface of the terminal (22).
- the device (13) queries the relational database by using the aforementioned index to implement the method (2) in order to associate, with each document, a sequence of character strings representing a morpheme or a group of morphemes.
- the documents, associated with these string sequences, and the identified periods are then used by a device (14) implementing the method (3) to determine the most frequent character strings in each period identified with respect to the set of documents.
- This method (3) works by first eliminating words that are part of stop word lists, then for each of the character strings, the device calculates the product called "TF-IDF" and retains the n strings for which this product is the highest, n being a parameter of the process whose value can for example be 5.
- the documents, associated with string sequences representing a morpheme or group of morphemes, as well as the n most frequent character strings for each identified period, are used by a device (15) implementing the method (4) constructing, for each period, an associated tag (30).
- This tag (30) is constructed by looking for the character string that includes one or more of the n strings retained by the device (14), which is included in the documents of the period, which is composed of a set of morphemes distinguished by the device (13), and which maximizes the function defined as the sum of the frequencies of all the substrings of characters in all the documents processed by the device (12).
- FIG. 2 shows the integration of the various methods of the invention into a wider standby device (26).
- a number of streams are published on the Internet (25) and are captured and stored in a relational database (10). These streams are filtered by a device (21) that determines the messages on a given subject.
- the documents are then processed by a device (20) implementing a method according to the invention.
- This device presents to the operator (27) a graph like that shown in FIG. 3 on the terminal (22). This graph shows a number of labels (30) allowing the operator (27) to interpret the abnormal or exceptional peaks and variations of the flow rate.
- This operator (27) can then modify the parameters of the filtering device (21) via a feedback loop (24).
- the device (20) then produces a new curve (34) representing the flow rate of the stream defined by the filter parameters. This new curve has new peaks, increases or abnormal or exceptional variations that the device (20) identifies and for which it produces new labels (30).
- the device (20) is also coupled to a notification system that allows the operator (28) to receive an alert on the terminal (23) when the flow rate of the flow has a peak, an increase or an abnormal variation.
- This alert is associated with a tag (30) that allows the operator (28) to determine the cause of the peak and to decide whether it is necessary to analyze this variation via the terminal (22) or by searching in the digital documents which constitute the stream and which are stored in the database (10).
- FIG. 3 represents a graph as generated by a device according to the invention.
- the signal is represented in the form of a graph with abscissa (32), time, and ordinate, the flow per unit time (33).
- This signal forms a curve (34) with peaks identified by the method (1) and highlighted by a marker at the local maximum (31). These markers are associated with the labels (30).
- the morphemes or groups of morphemes are first extracted from the digital documents, which are stored in a relational database with the associated list of morphemes, before the process ( 1) identifies abnormal or exceptional peaks, increases or variations.
- the device (13) when the volume of documents is too important to obtain a response within a reasonable time for the operator, queries the relational database (10) for recovering only a uniform pseudo-random sample of digital documents.
- this random sample is skewed to favor periods of peaks and of recesses revealed by the device (12). It was found that sampling is justified when the number of digital documents recorded in the relational database (10) and corresponding to the selection of the operator exceeds 10,000. In this case, the sample is 10,000, independently the actual volume of documents saved in the database.
- relational database (10) is replaced by a buffer memory which may contain a certain number of digital documents and covering a sufficient period with respect to the interrogations of the operator.
- the digital documents are multimedia documents
- the method (2) of morphological analysis is composed of a text extraction method by speech recognition or by optical recognition.
- the morphological analysis method (2) is coupled to an automatic translation method.
- the method and the device according to the invention are particularly intended for community monitoring on social networks.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR1100817A FR2972822A1 (fr) | 2011-03-18 | 2011-03-18 | Procede et dispositif permettant de reperer et d'etiqueter des pics, des augmentations ou des variations anormales ou execptionnelles du debit d'un flux de documents numeriques |
PCT/EP2012/054666 WO2012146440A1 (fr) | 2011-03-18 | 2012-03-16 | Procédé et dispositif permettant de repérer et d'étiqueter des pics, des augmentations ou des variations anormales ou exceptionnelles du débit d'un flux de documents numériques |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2686782A1 true EP2686782A1 (de) | 2014-01-22 |
Family
ID=45875953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP12710075.8A Ceased EP2686782A1 (de) | 2011-03-18 | 2012-03-16 | Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms |
Country Status (4)
Country | Link |
---|---|
US (1) | US20150205862A1 (de) |
EP (1) | EP2686782A1 (de) |
FR (1) | FR2972822A1 (de) |
WO (1) | WO2012146440A1 (de) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9703827B2 (en) * | 2014-07-17 | 2017-07-11 | Illumina Consulting Group, Inc. | Methods and apparatus for performing real-time analytics based on multiple types of streamed data |
CN109840471B (zh) * | 2018-12-14 | 2023-04-14 | 天津大学 | 一种基于改进Unet网络模型的可行道路分割方法 |
CN110348471B (zh) * | 2019-05-23 | 2023-09-01 | 平安科技(深圳)有限公司 | 异常对象识别方法、装置、介质及电子设备 |
US11086948B2 (en) | 2019-08-22 | 2021-08-10 | Yandex Europe Ag | Method and system for determining abnormal crowd-sourced label |
US11710137B2 (en) | 2019-08-23 | 2023-07-25 | Yandex Europe Ag | Method and system for identifying electronic devices of genuine customers of organizations |
US11108802B2 (en) | 2019-09-05 | 2021-08-31 | Yandex Europe Ag | Method of and system for identifying abnormal site visits |
RU2757007C2 (ru) | 2019-09-05 | 2021-10-08 | Общество С Ограниченной Ответственностью «Яндекс» | Способ и система для определения вредоносных действий определенного вида |
US11128645B2 (en) | 2019-09-09 | 2021-09-21 | Yandex Europe Ag | Method and system for detecting fraudulent access to web resource |
US11334559B2 (en) | 2019-09-09 | 2022-05-17 | Yandex Europe Ag | Method of and system for identifying abnormal rating activity |
RU2752241C2 (ru) | 2019-12-25 | 2021-07-23 | Общество С Ограниченной Ответственностью «Яндекс» | Способ и система для выявления вредоносной активности предопределенного типа в локальной сети |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2897942B2 (ja) | 1992-07-20 | 1999-05-31 | 株式会社シーエスケイ | 日本語形態素解析システム及び形態素解析方式 |
US7245769B2 (en) * | 2002-02-12 | 2007-07-17 | Visioprime | Archival of transformed and compressed data |
JP4342575B2 (ja) * | 2007-06-25 | 2009-10-14 | 株式会社東芝 | キーワード提示のための装置、方法、及びプログラム |
JP5078674B2 (ja) * | 2008-02-29 | 2012-11-21 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 分析システム、情報処理装置、アクティビティ分析方法、およびプログラム |
US8226568B2 (en) * | 2008-07-15 | 2012-07-24 | Nellcor Puritan Bennett Llc | Signal processing systems and methods using basis functions and wavelet transforms |
-
2011
- 2011-03-18 FR FR1100817A patent/FR2972822A1/fr not_active Withdrawn
-
2012
- 2012-03-16 WO PCT/EP2012/054666 patent/WO2012146440A1/fr active Application Filing
- 2012-03-16 US US14/005,803 patent/US20150205862A1/en not_active Abandoned
- 2012-03-16 EP EP12710075.8A patent/EP2686782A1/de not_active Ceased
Non-Patent Citations (2)
Title |
---|
None * |
See also references of WO2012146440A1 * |
Also Published As
Publication number | Publication date |
---|---|
FR2972822A1 (fr) | 2012-09-21 |
US20150205862A1 (en) | 2015-07-23 |
WO2012146440A1 (fr) | 2012-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2686782A1 (de) | Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms | |
Nguyen et al. | Automatic image filtering on social networks using deep learning and perceptual hashing during crises | |
US7577963B2 (en) | Event data translation system | |
US20140337328A1 (en) | System and method for retrieving and presenting concept centric information in social media networks | |
Weiler et al. | Event identification and tracking in social media streaming data | |
BE1025503B1 (fr) | Procede de segmentation de ligne | |
CN111581956B (zh) | 基于bert模型和k近邻的敏感信息识别方法及系统 | |
CN106844638B (zh) | 信息检索方法、装置及电子设备 | |
CN114915468B (zh) | 基于知识图谱的网络犯罪智能分析检测方法 | |
CN116756688A (zh) | 一种基于多模态融合算法的舆情风险发现方法 | |
KR20130037975A (ko) | 이슈 템플릿 추출 기반의 웹 동향 분석 방법 및 장치 | |
CA3182733A1 (en) | Vector space model for form data extraction | |
EP2013776A1 (de) | Verfahren zur schnellen neuduplikation einer menge von dokumenten oder einer menge von in einer datei enthaltenen daten | |
CN117173608A (zh) | 视频内容审核方法及系统 | |
FR2929426A1 (fr) | Procede et systeme d'attribution de score | |
Hisham et al. | An innovative approach for fake news detection using machine learning | |
US20190370531A1 (en) | Data processing apparatus, data processing method, and non-transitory storage medium | |
Sumathi et al. | Fake review detection of e-commerce electronic products using machine learning techniques | |
Zendah et al. | Detecting Significant Events in Arabic Microblogs using Soft Frequent Pattern Mining. | |
CN116723005A (zh) | 多态隐藏下的恶意代码隐式情报追踪方法及系统 | |
CN116401434A (zh) | 一种网络数据信息智能提取系统 | |
Khan et al. | Object analysis in image mining | |
CN108052503B (zh) | 一种置信度的计算方法及装置 | |
Prasad et al. | Face-Based Alumni Tracking on Social Media Using Deep Learning | |
Hurst | Temporal Text Mining. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20131017 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20170608 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: SEMIOCAST |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: GUYOT, PAUL Inventor name: JULIEN, DAVID Inventor name: CAMPAGNE, JEAN-CHARLES |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R003 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20190530 |