EP2686782A1 - Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms - Google Patents

Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms

Info

Publication number
EP2686782A1
EP2686782A1 EP12710075.8A EP12710075A EP2686782A1 EP 2686782 A1 EP2686782 A1 EP 2686782A1 EP 12710075 A EP12710075 A EP 12710075A EP 2686782 A1 EP2686782 A1 EP 2686782A1
Authority
EP
European Patent Office
Prior art keywords
documents
flow
digital documents
peaks
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP12710075.8A
Other languages
English (en)
French (fr)
Inventor
Jean-Charles Campagne
Paul Guyot
David JULIEN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Semiocast
Original Assignee
Semiocast
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Semiocast filed Critical Semiocast
Publication of EP2686782A1 publication Critical patent/EP2686782A1/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • G06F2218/14Classification; Matching by matching peak patterns

Definitions

  • the field of the invention is telecommunications and, in particular, the analysis of digital document flows.
  • the invention also applies to the analysis of large masses of digital documents.
  • These digital documents may be e-mails, short GSM messages, messages, articles or comments posted on Internet sites, blogs, forums or social networks, instant messages and any other type of message or digital document posted or published. , in the form of text or which has a text or which can be analyzed by a device generating a text such a voice recognition device.
  • These digital documents can be addressed specifically or implicitly to recipients or made public for a community or for everyone.
  • These digital documents are associated with one or more dates5 of publication, sending or modification.
  • the present invention relates to a method and a device for identifying and labeling abnormal or exceptional peaks, increases or variations in the throughput of a digital document stream from one or more social networks or a collection of blogs or websites, to alert an operator or produce a synthetic and explanatory view of the evolution of the flow.
  • the general problem is to produce a synthetic and explanatory view of the evolution of the flow rate of a flow of digital documents, or to alert an operator by indicating the main subject or subjects of abnormalities or abnormal or exceptional variations thereof. debit.
  • SUBSTITUTE SHEET (RULE 26) the flow at a given moment with the average flow over a longer period.
  • More advanced methods rely on transformations, for example on discrete wavelet transforms, as described by A. Haar in the article "Zur Théorie der orthogonalen Funkti onensysteme” published in Mathe- matician Annalen 69 en 1910, no. 3, pages 331-371; there is abundant literature on the detection of peaks from such transformations, such as international patent application WO 2010/007486, or anomalies, such as CT communication Huang et al. entitled “Wavelet-based Real Time Detection of Network Traffic Anomalies" in Securecomm and Workshops published in 2006 by IEEE.
  • This technical equipment generates huge amounts of digital data, ranging from tens of thousands to millions of messages or articles per day.
  • the known means make it possible to construct an evolution curve of the information rate over time for the whole of a selected stream or for a selection corresponding to predetermined specifications (for example the flow of documents which contain such a word or combination of terms).
  • predetermined specifications for example the flow of documents which contain such a word or combination of terms.
  • the technical problem that arises is that of the real-time processing capacity of large volumes of data to carry out analyzes that make it possible to explain changes in volume or flow rate indicative of external events.
  • the method according to the invention overcomes the disadvantages of traditional methods.
  • the invention proposes a technical method, executed by a computer, comprising a succession of processing steps:
  • this step identifies the changes in regime in the flow. This step involves determining one or more time intervals and ordering recording the digital documents corresponding to these intervals for subsequent processing;
  • the second step concerns a technical processing consisting in extracting sequences of characters from the documents thus isolated, by cutting the texts into strings of characters and recording in another memory zone the strings of characters thus identified;
  • the third step is to create an index of strings extracted from the second step, to associate the relevant documents and a quantitative indicator that measures the importance of a string of characters in these documents relative to the flow and then to determine the most important strings in relation to this quantitative indicator;
  • the last step is to provide a label constructed from the documents associated with the strings identified in the third step.
  • the invetion comprises, according to a first characteristic:
  • the first method operates according to a high pass filter based on wavelets.
  • the documents are counted per time unit (hour, day), and the sequence thus determined forms a signal on which a filtering is performed by eliminating the coefficients of the discrete wavelet decomposition which are below a certain threshold in absolute value. .
  • the distinguished periods are defined as periods during which the signal recomposed after filtering has a strictly positive value. Compared to the naive and obvious approach for the skilled person to compare the number of documents per unit time compared to the average, this approach has the double advantage of identifying peaks or exceptional increases even when the average flow is high but the recent flow is lower than average, and limit peak periods more precisely than just exceeding the average;
  • the first method works by comparing the signal with a periodic or quasi-periodic model.
  • a periodic or quasi-periodic model is established a priori, for example as the linear combination of several periodic functions of period of 24 hours or 7 days.
  • the model coefficients are obtained by the least squares method from the historical data.
  • Distinguished periods are defined as the periods during which the difference between the signal and the model is greater than a certain threshold.
  • the second method is a cutting of the digital documents according to the spaces and the punctuation.
  • This approach has the advantage of being very simple and easy to implement.
  • the cutting thus produced does not correspond to a very precise morphological analysis but is sufficient, in the context of the invention, to obtain labels for each of the peaks, increases or abnormal or exceptional variations in the flow rate;
  • the second method is a cutting of digital documents according to a segmentation model based on statistical data, grammatical rules, dictionary or hidden Markov chains.
  • a segmentation model based on statistical data, grammatical rules, dictionary or hidden Markov chains.
  • This approach has the advantage of being able to extract strings of digital documents written in languages where the words are generally not separated by spaces or punctuations, such as Japanese, Chinese or Thai;
  • the second method consists of a first step of identifying the language of the digital document and then a set of methods for separating the specialized words for each of the languages processed.
  • the third method works by eliminating strings of characters determined by the second method those which appear in a list of empty words or tool words. This approach has the advantage of avoiding constructing labels from empty words or tool words;
  • the third method works by calculating the product "TF-IDF" for the occurrences of the character strings extracted by the second method, then selecting the channel or chains for which this product is the highest;
  • the fourth method works by searching the character string composed of a set of morphemes distinguished by the second method and present in the digital documents which maximizes a function defined as the sum of the frequencies of the set of substrings of characters of this chain in all digital documents;
  • the process as a whole is implemented in a device which presents the operator with a graph of the flow rate and highlights the main peaks, increases or abnormal or exceptional variations in the flow rate and displays, statically or interactively, labels associated with these abnormal or exceptional peaks, increases or variations;
  • the method as a whole is implemented in a device coupled with a parameterizable filtering system which presents to the operator a graph of the flow rate of a subset of the analyzed flow, highlights the main peaks, increases or variations; abnormal or exceptional flow and associates them with labels.
  • This device advantageously allows the operator to adjust the filtering to analyze more particularly the flow rate with respect to these peaks, to obtain more information on these peaks or the rest of the curve, and possibly reveal other peaks;
  • FIG. 2 represents a device which presents to the operator a graph of the flow rate by highlighting the main peaks, increases or abnormal or exceptional variations in the flow rate and which is coupled to a noti fication system;
  • FIG. 1 represents the composition of the various processes and the flow (11) of digital documents through a device according to the invention.
  • the digital documents are initially stored in alphanumeric form in a table of a relational database (10).
  • Each digital document is stored on a line comprising a column with the text of the document, and a column with the date of publication of the document if it exists, or the date on which the document was retrieved, otherwise.
  • the relational database is configured to index the column of the date with an ordered index, for example in the form of a tree of type B-Tree.
  • the device (12) When the operator (27) interrogates the idle device (26), the device (12), at first, queries the relational database using the aforementioned index, to count, for each period of time (hour or day), the number of documents stored in the database, on a window chosen by the operator.
  • This information makes it possible to draw the flow rate curve of the documents on the terminal (22) and an example of which is represented in FIG. 3.
  • This curve can synthesize a very large mass of documents. This curve can be refreshed in real time when new documents are stored in the relational database (10).
  • the device (12) implements the method (1) to identify periods of peaks, increases or abnormal or exceptional variations. These peak periods can be highlighted by a marker (31) at the local maximum on the interface of the terminal (22).
  • the device (13) queries the relational database by using the aforementioned index to implement the method (2) in order to associate, with each document, a sequence of character strings representing a morpheme or a group of morphemes.
  • the documents, associated with these string sequences, and the identified periods are then used by a device (14) implementing the method (3) to determine the most frequent character strings in each period identified with respect to the set of documents.
  • This method (3) works by first eliminating words that are part of stop word lists, then for each of the character strings, the device calculates the product called "TF-IDF" and retains the n strings for which this product is the highest, n being a parameter of the process whose value can for example be 5.
  • the documents, associated with string sequences representing a morpheme or group of morphemes, as well as the n most frequent character strings for each identified period, are used by a device (15) implementing the method (4) constructing, for each period, an associated tag (30).
  • This tag (30) is constructed by looking for the character string that includes one or more of the n strings retained by the device (14), which is included in the documents of the period, which is composed of a set of morphemes distinguished by the device (13), and which maximizes the function defined as the sum of the frequencies of all the substrings of characters in all the documents processed by the device (12).
  • FIG. 2 shows the integration of the various methods of the invention into a wider standby device (26).
  • a number of streams are published on the Internet (25) and are captured and stored in a relational database (10). These streams are filtered by a device (21) that determines the messages on a given subject.
  • the documents are then processed by a device (20) implementing a method according to the invention.
  • This device presents to the operator (27) a graph like that shown in FIG. 3 on the terminal (22). This graph shows a number of labels (30) allowing the operator (27) to interpret the abnormal or exceptional peaks and variations of the flow rate.
  • This operator (27) can then modify the parameters of the filtering device (21) via a feedback loop (24).
  • the device (20) then produces a new curve (34) representing the flow rate of the stream defined by the filter parameters. This new curve has new peaks, increases or abnormal or exceptional variations that the device (20) identifies and for which it produces new labels (30).
  • the device (20) is also coupled to a notification system that allows the operator (28) to receive an alert on the terminal (23) when the flow rate of the flow has a peak, an increase or an abnormal variation.
  • This alert is associated with a tag (30) that allows the operator (28) to determine the cause of the peak and to decide whether it is necessary to analyze this variation via the terminal (22) or by searching in the digital documents which constitute the stream and which are stored in the database (10).
  • FIG. 3 represents a graph as generated by a device according to the invention.
  • the signal is represented in the form of a graph with abscissa (32), time, and ordinate, the flow per unit time (33).
  • This signal forms a curve (34) with peaks identified by the method (1) and highlighted by a marker at the local maximum (31). These markers are associated with the labels (30).
  • the morphemes or groups of morphemes are first extracted from the digital documents, which are stored in a relational database with the associated list of morphemes, before the process ( 1) identifies abnormal or exceptional peaks, increases or variations.
  • the device (13) when the volume of documents is too important to obtain a response within a reasonable time for the operator, queries the relational database (10) for recovering only a uniform pseudo-random sample of digital documents.
  • this random sample is skewed to favor periods of peaks and of recesses revealed by the device (12). It was found that sampling is justified when the number of digital documents recorded in the relational database (10) and corresponding to the selection of the operator exceeds 10,000. In this case, the sample is 10,000, independently the actual volume of documents saved in the database.
  • relational database (10) is replaced by a buffer memory which may contain a certain number of digital documents and covering a sufficient period with respect to the interrogations of the operator.
  • the digital documents are multimedia documents
  • the method (2) of morphological analysis is composed of a text extraction method by speech recognition or by optical recognition.
  • the morphological analysis method (2) is coupled to an automatic translation method.
  • the method and the device according to the invention are particularly intended for community monitoring on social networks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP12710075.8A 2011-03-18 2012-03-16 Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms Ceased EP2686782A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR1100817A FR2972822A1 (fr) 2011-03-18 2011-03-18 Procede et dispositif permettant de reperer et d'etiqueter des pics, des augmentations ou des variations anormales ou execptionnelles du debit d'un flux de documents numeriques
PCT/EP2012/054666 WO2012146440A1 (fr) 2011-03-18 2012-03-16 Procédé et dispositif permettant de repérer et d'étiqueter des pics, des augmentations ou des variations anormales ou exceptionnelles du débit d'un flux de documents numériques

Publications (1)

Publication Number Publication Date
EP2686782A1 true EP2686782A1 (de) 2014-01-22

Family

ID=45875953

Family Applications (1)

Application Number Title Priority Date Filing Date
EP12710075.8A Ceased EP2686782A1 (de) 2011-03-18 2012-03-16 Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms

Country Status (4)

Country Link
US (1) US20150205862A1 (de)
EP (1) EP2686782A1 (de)
FR (1) FR2972822A1 (de)
WO (1) WO2012146440A1 (de)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9703827B2 (en) * 2014-07-17 2017-07-11 Illumina Consulting Group, Inc. Methods and apparatus for performing real-time analytics based on multiple types of streamed data
CN109840471B (zh) * 2018-12-14 2023-04-14 天津大学 一种基于改进Unet网络模型的可行道路分割方法
CN110348471B (zh) * 2019-05-23 2023-09-01 平安科技(深圳)有限公司 异常对象识别方法、装置、介质及电子设备
US11086948B2 (en) 2019-08-22 2021-08-10 Yandex Europe Ag Method and system for determining abnormal crowd-sourced label
US11710137B2 (en) 2019-08-23 2023-07-25 Yandex Europe Ag Method and system for identifying electronic devices of genuine customers of organizations
US11108802B2 (en) 2019-09-05 2021-08-31 Yandex Europe Ag Method of and system for identifying abnormal site visits
RU2757007C2 (ru) 2019-09-05 2021-10-08 Общество С Ограниченной Ответственностью «Яндекс» Способ и система для определения вредоносных действий определенного вида
US11128645B2 (en) 2019-09-09 2021-09-21 Yandex Europe Ag Method and system for detecting fraudulent access to web resource
US11334559B2 (en) 2019-09-09 2022-05-17 Yandex Europe Ag Method of and system for identifying abnormal rating activity
RU2752241C2 (ru) 2019-12-25 2021-07-23 Общество С Ограниченной Ответственностью «Яндекс» Способ и система для выявления вредоносной активности предопределенного типа в локальной сети

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2897942B2 (ja) 1992-07-20 1999-05-31 株式会社シーエスケイ 日本語形態素解析システム及び形態素解析方式
US7245769B2 (en) * 2002-02-12 2007-07-17 Visioprime Archival of transformed and compressed data
JP4342575B2 (ja) * 2007-06-25 2009-10-14 株式会社東芝 キーワード提示のための装置、方法、及びプログラム
JP5078674B2 (ja) * 2008-02-29 2012-11-21 インターナショナル・ビジネス・マシーンズ・コーポレーション 分析システム、情報処理装置、アクティビティ分析方法、およびプログラム
US8226568B2 (en) * 2008-07-15 2012-07-24 Nellcor Puritan Bennett Llc Signal processing systems and methods using basis functions and wavelet transforms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2012146440A1 *

Also Published As

Publication number Publication date
FR2972822A1 (fr) 2012-09-21
US20150205862A1 (en) 2015-07-23
WO2012146440A1 (fr) 2012-11-01

Similar Documents

Publication Publication Date Title
EP2686782A1 (de) Verfahren und vorrichtung zur erkennung und markierung von spitzen, erhöhungen oder anormalen oder aussergewöhnlichen veränderungen des durchsatzes eines digitalen dokumentenstroms
Nguyen et al. Automatic image filtering on social networks using deep learning and perceptual hashing during crises
US7577963B2 (en) Event data translation system
US20140337328A1 (en) System and method for retrieving and presenting concept centric information in social media networks
Weiler et al. Event identification and tracking in social media streaming data
BE1025503B1 (fr) Procede de segmentation de ligne
CN111581956B (zh) 基于bert模型和k近邻的敏感信息识别方法及系统
CN106844638B (zh) 信息检索方法、装置及电子设备
CN114915468B (zh) 基于知识图谱的网络犯罪智能分析检测方法
CN116756688A (zh) 一种基于多模态融合算法的舆情风险发现方法
KR20130037975A (ko) 이슈 템플릿 추출 기반의 웹 동향 분석 방법 및 장치
CA3182733A1 (en) Vector space model for form data extraction
EP2013776A1 (de) Verfahren zur schnellen neuduplikation einer menge von dokumenten oder einer menge von in einer datei enthaltenen daten
CN117173608A (zh) 视频内容审核方法及系统
FR2929426A1 (fr) Procede et systeme d'attribution de score
Hisham et al. An innovative approach for fake news detection using machine learning
US20190370531A1 (en) Data processing apparatus, data processing method, and non-transitory storage medium
Sumathi et al. Fake review detection of e-commerce electronic products using machine learning techniques
Zendah et al. Detecting Significant Events in Arabic Microblogs using Soft Frequent Pattern Mining.
CN116723005A (zh) 多态隐藏下的恶意代码隐式情报追踪方法及系统
CN116401434A (zh) 一种网络数据信息智能提取系统
Khan et al. Object analysis in image mining
CN108052503B (zh) 一种置信度的计算方法及装置
Prasad et al. Face-Based Alumni Tracking on Social Media Using Deep Learning
Hurst Temporal Text Mining.

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20131017

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20170608

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SEMIOCAST

RIN1 Information on inventor provided before grant (corrected)

Inventor name: GUYOT, PAUL

Inventor name: JULIEN, DAVID

Inventor name: CAMPAGNE, JEAN-CHARLES

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20190530