EP3782054A1 - Verfahren und vorrichtung zur überprüfung des autors einer kurznachricht - Google Patents

Verfahren und vorrichtung zur überprüfung des autors einer kurznachricht

Info

Publication number
EP3782054A1
EP3782054A1 EP19724617.6A EP19724617A EP3782054A1 EP 3782054 A1 EP3782054 A1 EP 3782054A1 EP 19724617 A EP19724617 A EP 19724617A EP 3782054 A1 EP3782054 A1 EP 3782054A1
Authority
EP
European Patent Office
Prior art keywords
text
texts
questioned
author
windows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP19724617.6A
Other languages
English (en)
French (fr)
Inventor
Guy GENILLOUD
Alexandre-Pierre COTTY
Augustin Camille KASSER
Antoine JOVER
Adrien DONNET-MONAY
Florent DEVILLARD
Constanze ANDEL RIMENSBERGER
Valentin ROTEN
Stefan CODRESCU
Alain Favre
Luc-Olivier POCHON
Lionel POUSAZ
Claire ROTEN
Stéphane RIAND
Serge NICOLLERAT
Myriam EUGSTER
Jean-Luc BUHLMANN
Léonard André Henri STUDER
Claude-Alain ROTEN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orphanalytics SA
Original Assignee
Orphanalytics SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orphanalytics SA filed Critical Orphanalytics SA
Publication of EP3782054A1 publication Critical patent/EP3782054A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Definitions

  • the present invention relates to the problem of assigning an author to a text, in particular a short text, for example a text of less than 500 characters.
  • WO2008 / 036059 discloses an author identification method based on the linguistic analysis of units of the text.
  • the linguistic analysis is based for example on the lexical analysis, including the frequency of appearances of certain words or prepositions, as well as the stylometric analysis, including the punctuation, the average length of the words, the number of words short, or the average length of the paragraphs. Analysis
  • JGAAP Java Graphical Authorship Attribution Program
  • these objects are achieved in particular by means of parameters characterizing the style of the document, or a window in the document.
  • the choice of these style parameters and / or their value can be determined automatically. They advantageously make it possible to characterize the style of a window automatically and objectively.
  • the invention also relates to a method for verifying whether a questioned text, complete or fragmented, of less than 500 characters has written by an author, including the following steps:
  • This method can be performed by a computer or other digital processing system. It has the advantage of only having steps that can be effectively implemented by a digital processing system, but would be very difficult or virtually impossible to achieve without the assistance of such a system.
  • this method allows a performing computer performance and efficient.
  • the questioned text can be a complete text, for example a message of less than 500 characters, or a fragment of less than 500 characters extracted from a complete text.
  • Clustering consists of a grouping of points.
  • This hierarchical clustering minimizes the distances in a dendrogram (we speak of cophenetic distances).
  • the method of the invention thus combines two statistical analysis tools that are normally used.
  • an ASM multivariate statistical analysis, for example a PCA or a PCoA
  • This clustering can implement methods of UPGMA type, Minimum Variance, WPGMA, NJ for example.
  • the result of the ASM is an N-dimensional coordinate matrix that is subject to hierarchical clustering of the distances between points of a multidimensional space.
  • the result obtained can be represented by a dendrogram, which allows, if it is robust, to decide whether a text can be attributed to an author, or not.
  • the method may comprise the establishment of a robustness measurement of the dendrogram using a cophenetic correlation coefficient.
  • This dendrogram evaluation technique makes it possible to use the results of the process more often even when the cophenetic correlation coefficient is medium or even low.
  • a visual confirmation of the robustness of a dendrogram can be obtained by comparing its structure with that of others.
  • dendrograms obtained by different clustering methods (UPGMA, Minimum Variance, WPGMA, NJ, .
  • the robustness of a dendrogram is furthermore testable either by statistically analyzing the cophenetic distance measurements, or by comparing the close relations of the terminal buds ("leaf nodes") of the dendrogram.
  • Author attribution is done by confirming or denying the distribution of texts according to a departure hypothesis, HD1, according to which the questioned text is attributed to an author.
  • HD1 a departure hypothesis
  • the text or texts questioned are confronted in turn to texts of at least two reference authors (known authors who have certified the production of their texts). These reference texts are similar in nature, number and size as the texts questioned.
  • Dendrograms which test three authors by comparison, are generated.
  • each author is tested by pair of authors 210 times.
  • a statistical count is established to determine the number of times that the assumption underlying each dendrogram is verified.
  • the frequency of results in favor of the hypothesis is established.
  • the 350 tests, which compare only reference authors, make it possible to establish the height of the signal necessary for the acceptance of the hypothesis of the author attribution of the texts questioned.
  • the robustness of the approach is tested by formulating a new hypothesis HD2, for example by adding to the texts questioned in HD1 one or more texts of the same author or another author. Several initial hypotheses concerning texts attributed to the author are thus testable in parallel.
  • the invention also starts from the observation that semantic patterns (for example the number of occurrences of words or lemmas) in a short text are not very useful for identifying an author, because this type of reason is statistically too rare to provide a reliable indication of the author.
  • the method of the invention therefore proposes to use only relatively frequent patterns, for example letter patterns.
  • the method also proposes to standardize the text, replacing all the capital letters with lower case letters, and all the letters accentuated by the corresponding lower case character (eg the letter "é” is replaced by "e”, "ç” by "c”, etc.). Surprisingly, he found that this
  • the problem of author verification of a questioned text short for example a text of less than 500 words, is in particular solved by a method comprising the following steps:
  • predefined patterns comprising exclusively intra and / or interword word patterns
  • standardization preferably converts the basic text into a text with only 27 characters (26 letters and the space symbol).
  • the cutting is advantageously independent of the contents; for example, it is advantageous to cut a text or another sequence of symbols into windows having all, or almost all except for example the first or the last, the same length.
  • This feature makes it possible to compare with windows of optimal length, that is, not too short to avoid style measurements disturbed by rare events, or too long to allow plagiarism detection of short sequences.
  • the length of the windows is advantageously between 150 and 2000. In this case, the questioned text is thus not cut out;
  • the windows are preferably offset from each other by characters, some windows comprising a portion of the end of the text and a portion of the beginning of the text. This cyclization makes it possible to stabilize the final stylometric signal.
  • the patterns preferably correspond to either:
  • the analysis may be a multivariate analysis (PCA or PCoA).
  • the method may comprise a step of clustering the results of the multivariate analysis (UPGMA, Minimum Variance, WPGMA, NJ,
  • the analysis may be based on a measurement of distance to the centers of gravity.
  • the method may include the establishment of a dendrogram to determine if two texts were produced by the same author.
  • the questioned text is attributed to an author by confirming or reversing a distribution of the text according to an attribution hypothesis.
  • two sub-clusters of texts questioned are created from the group of texts questioned, according to their distance from one of said groups of reference text, and the difference between the mean of the cophenetic distances between the fragments of each sub-cluster is determined with a group of reference text in order to determine whether the two subclusters come from the same author or not.
  • the type of distance used during the multivariate statistical analysis can be selected according to the analysis strategy. For example, we will preferably choose a Boolean distance for a short text, and another distance, for example a Euclidean distance, for a longer text. The type of distance used during construction of the dendrogram may be selected.
  • a first type of distance for a multivariate approach and a second type of distance for an approach based on a dendrogram, and a third type for an approach based on the distance to a centroid will be chosen.
  • the type of distance used for measuring the distances to the centers of gravity can be selected according to the analysis strategy.
  • Statistically weighted distances eg standardized Euclidean distance, weighted according to standard deviation
  • distances eg standardized Euclidean distance, weighted according to standard deviation
  • selectable includes at least two distances, for example two distances to be chosen from the following distances: distance of the ropes, Euclidean, Euclidian normalized, Manhattan, Canberra, Chi square [c 2 ], distance of generalized Jaccard.
  • the invention comes from the observation that these language bricks are highly personal and difficult to manipulate.
  • the style parameters of each portion of text thus constitute a biometric trace of the author's stylometric signature. It is observed that the style parameters associated with each author depend on his way of thinking, much like the phrasing played by a jazzman is highly personal.
  • the letter patterns in a text naturally depend on the type of text. In French, a medical text presents a high occurrence of trigrams "dare" or "ite".
  • the method may include calculating a stylometric distance between the number of occurrences of patterns in a text to be verified and a reference text: for example a distance from the strings, Euclidean, Euclidean normalized, Manhattan, Canberra, Khi square (c 2 ), etc. It can be measured between two windows, between a window and a group of windows or between two groups of windows representing all or part of one or more sequences of letters.
  • the analysis of the occurrences of predefined patterns may include groupings by different multivariate statistical treatments.
  • a principal component analysis (PCA), or principal coordinate analysis (PCoA main coordinates also called MDS MultiDimensional Scaling) working on mathematical distances defined between observations of style parameters (eg bigrams) reduces the number of original dimensions (the number of types of bigrams).
  • PCA principal component analysis
  • PCoA main coordinates also called MDS MultiDimensional Scaling working on mathematical distances defined between observations of style parameters (eg bigrams) reduces the number of original dimensions (the number of types of bigrams).
  • MDS MultiDimensional Scaling principal coordinate analysis
  • Such groupings make it possible to detect the most characteristic style parameters of an author.
  • the Euclidean distance is performed without multivariate statistical processing. This approach is more sensitive to noise, since the stylometric distance between two windows takes into account all style parameters, even the least individual ones. On the other hand, it avoids using the most characteristic style parameters with less
  • Figure 2 illustrates the memory of the device of Figure 1
  • Figure 6 shows the first two dimensions of an ASM on trigrams taken from text fragments obtained after cutting to about 500 characters.
  • Figure 7 is drawn from an ASM (like that of Figure 6) and illustrates the distance of each text fragment to the barycenters of three clusters.
  • Figure 8 illustrates an example of a dendrogram.
  • Figure 9 illustrates an example of a perfect dendrogram.
  • Figure 10 shows a first example of an almost perfect dendrogram.
  • Figure 11 shows a second example of a dendrogram
  • Figure 12 illustrates an example of a dendrogram with two entangled branches.
  • Figure 13 illustrates an example of an entangled three-branched dendrogram.
  • the method of detecting breaks in style described in this application has the particular advantage of being implemented by means of a computer device 1, for example a computer or a server such as the one illustrated. schematically in Figure 1.
  • This device comprises in particular one or more processors 10, a random access memory 11, a read-only memory 12, a graphics card 13 for controlling a screen 17, an input-output port, for example a USB port 14 , allowing the connection of external peripherals such as scanner 18, printer, etc., a network card 15 for connection to a network 19, for example an Ethernet network, and data input devices such as keyboard, mouse, screen touch, etc.
  • the memory 11 comprises a portion 110 for the operating system, a portion 111 for the data and a portion 112 for the application programs.
  • This portion 112 includes in particular a windowing module 113, a stylistic parameter determination module 114, a stylistic distance calculation module 115, and a style break identification module 116.
  • the "modules" above are advantageously consisting of portions of computer code, for example programs, program extracts, routines, procedures, etc., arranged to be executed by the microprocessor 10 in order to make it execute the windowing operations, the determination of stylistic parameters , stylistic distance calculation, and respectively identification of breaks in style that will be described below as an example.
  • These modules can be stored on a computer medium, for example a cd-rom, a hard disk, a flash memory, etc., before being loaded into memory 11 as illustrated.
  • the method allows to check the style of a document, and compare it with the style of a reference document to determine if they were written by the same author.
  • style we mean the catalog of occurrence of predefined letter patterns.
  • the first step of the method is therefore to obtain in electronic copy at least one short text to test (questioned text) and at least one reference text of the author to check (reference text).
  • the reference text may be longer than the text questioned.
  • This sequence of symbols can be loaded for example from the Internet, via e-mail, from a removable data medium etc.
  • a windowing module 113 normalizes the text to be queried, and at least one reference text, by removing the symbols of
  • the windowing module 113 divides at least one reference text, and possibly the questioned text, into a plurality of windows 20A, 20B, and so on.
  • Each window 20 is constituted by a sequence of consecutive L letters within the complete sequence.
  • Window cutting is preferably independent of the contents; it is not therefore a division into grammatical or syntactic elements, and is independent for example of the beginning or the end of sentences, paragraphs or pages. This allows analysis with window sizes independent of the author's style. This also allows an analysis of punctuation sequences by windows of fixed length.
  • the windows 20 overlap partially, in the sense that certain symbols, or even most of the symbols
  • the window 20A comprises the sequence of characters
  • Window 20B is obtained from the first window 20A and the symbol sequence 2 by an offset of K symbols, here 20.
  • K offset values different from 20 can also be used, provided that K is less than the length L of the windows.
  • the offset value can be a parameter chosen by the user during the execution of the program, depending on the type of documents, the computing power available, the required accuracy, etc.
  • the offset value can be derived from one or other user-selected parameters. For example, the user chooses a degree of coverage C, indicating the number of windows to which each symbol must belong simultaneously, and the value of K is calculated accordingly.
  • the module 114 determines the number of occurrences of predefined patterns in each window.
  • the number of patterns counted in each window can be significant; for example, in the case of a calculation of trigrams, the number of possible trigrams will be 27 * 27 * 27.
  • the accounted patterns are exclusively patterns that can occur in statistically representative quantities in a short text.
  • the semantic motifs are preferably excluded, the probability of finding the same word several times in a short text is small.
  • the following pattern occurrences can be counted:
  • each sequence may include one or more alternate insert characters ( ⁇ a * a>, ⁇ a * b>, etc.; ⁇ a ** a>, ⁇ a ** b>, etc., the insert character * can be any character).
  • counted include an accumulation of bigrams signals, trigrams, etc. in order to make a multivariate analysis on all of these dimensions.
  • the analysis thus includes a Principal Component Analysis (PCA) statistical analysis in order to group the counts of different patterns.
  • PCA Principal Component Analysis
  • the analysis comprises a PCoA (Principal Coordinate Analysis).
  • Figure 4 illustrates the position in a three-dimensional space of 17 windows each represented by a symbol, resulting from a multivariate analysis.
  • Each axis can for example correspond to the frequency of a pattern; in a variant, each axis corresponds to a dimension obtained after a multivariate analysis, according to the size reduction of a multivariate statistical processing to optimize the variance between windows carried by the style parameters.
  • the rounds correspond to windows written by a first author, the two triangles to windows written by a second author; the stars correspond to the average points of the groups of windows corresponding to each of the two authors. It is obvious that the number of dimensions can be much larger than three in the case where more than three distinct patterns are extracted from each window 20 and that these patterns are not grouped together.
  • Figure 5 maps the distance to the average point of each window (20A, 20B, ...., 20i) on a curve.
  • the large distance jump between the window 20A and the window 20B at the beginning of the sequence shows a break of style between these two windows and is an index of author change.
  • the mathematical stylometric distance between points may be a Euclidean distance, a Manhattan distance, or a cos Q distance for example.
  • the stylometric distance used is a Boolean distance, for example a distance between two binary vectors (called the binary distance), each component of the vector indicating the presence or absence of a stylometric pattern.
  • the binary distance a distance between two binary vectors
  • Jaccard, Rogers-Tanimoto, Simpson or Yule Sigma can be used.
  • a description of this type of distance and their use in clustering is presented by Seung-Seok Choi et al. in "A Survey of Binary Similarity and Distance Measures," SYSTEMICS, CYBERNETICS AND INFORMATICS, Vol.8, Num. 1, 2000.
  • This type of distance makes it possible to work with a large number of dimensions and is therefore particularly suited to the cumulative approaches mentioned above, in which a large number of different patterns are counted. They therefore make it possible to measure a distance between a large number of dimensions of a small object, for example a short text.
  • the stylometric distance calculation module 115 then groups the text extracts by calculating the stylometric distance between points of the multidimensional space represented by a dendrogram.
  • the different texts questioned and referenced are grouped using a classification / clustering method, such as UPGMA, UPGMC, Minimum Variance, WPGMA, WPGMC, NJ, ...)
  • the result of multivariate statistical analysis is thus used to construct a taxonomy.
  • the result of this grouping is a dendrogram, that is to say a diagram which represents affinities (similarities of style) between texts, which may be questioned texts or reference texts.
  • the grouping of these texts is based on the coordinate matrix, which indicates the (dis) similarities or distances between texts. Texts of very similar styles are worn together by a common branch of the
  • a robust dendrogram makes it possible to decide whether a questioned text can be attributed to an author of the match or not. On the other hand, no reliable decision can be made if the dendrogram is not robust enough.
  • a standard measure of robustness of a dendrogram is the cophenetic correlation coefficient. It is based on the cophenetic distances between the fragments, measured on the dendrogram. These distances are different from the original distances between the same fragments but measured in the ASM.
  • the cophenetic correlation coefficient evaluates the relation between the cophenetic distances (from the dendrogram) and the "original" distances (between the fragments in the ASM).
  • a confirmation of the robustness of a dendrogram can be obtained by comparing its structure with that of other dendrograms obtained by different clustering methods (UPGMA, Minimum Variance, WPGMA, NJ, etc.).
  • the robustness of a dendrogram is in addition testable either by statistically analyzing the cophenetic distance measurements, or by comparing the proximity relations of the terminal buds of the dendrogram.
  • a first step it is tested whether the group of texts questioned (Q) is significantly away from the other two groups of reference texts (A and B), known authors, which he is confronted.
  • QQ, QA, QB, AA, AB and BB we calculate the average of the distances between the text fragments of the two groups of the couple, with their standard deviation and their number (ie number of text fragments ). Then, for each group, we calculate its interval of
  • the statistical hypothesis H0 is accepted: the clusters Q1 and Q2 are distinct; there are therefore four clusters in the dendrogram considered (Q1, Q2, A and B). The experiment does not therefore make it possible to establish that Q1 and Q2 are from the same editor. In the opposite case (if this difference is less than the sum of the confidence intervals), the statistical hypothesis H0 is rejected: we can then affirm that Q1 and Q2 are of the same editor with a probability of being wrong equal to the probability threshold chosen to calculate the confidence interval.
  • the clustering of the group of texts questioned therefore amounts to partitioning all the texts questioned into at least two groups such that the stylometric distance between a member of a group is reduced.
  • the ASM calculates the coordinates of text extracts on N dimensions, where N is the number of dimensions necessary to reach a cumulative percentage of variance (eg, 90%). In other words, all the coordinates are used with a coefficient 1 for the N principal dimensions, which carry the discriminant signal, and 0 for the other dimensions, whose signal is noisy. In another embodiment, weighting coefficients are implemented to give more weight to the first dimensions, depending on their importance.
  • the module 116 determines on the basis of the dendrogram if the text questioned comes from the same author as one of the texts, or set of reference texts A, B.
  • the cophenetic correlation coefficient can be calculated and displayed.
  • several author attribution tests with several types of complementary statistical validations are made, using texts of the same nature (for example two texts from a blog, two messages of threat etc). These texts of the same nature serve as reference texts, from at least three known authors, and are collected for this purpose. For example, performing 10 independent tests (with 10 different reference writers) can reduce the probability of error by a factor of 10. In our example, this probability would increase from 5% to 0.5%.
  • a dendrogram will be called perfect if it is perfect distribution, that is to say if it combines the texts of styles / authors supposed in as many main branches as styles / authors.
  • Figure 9 illustrates in this respect an example of a perfect dendrogram.
  • the three authors or styles supposed A, B and C clusters according to the three main branches of the dendrogram.
  • the distance between A1 and B1 is equal to the distance between A1 and B2, and that between A2 and B1, etc.
  • the relationship between the texts of a pair of authors is considered perfect if the distances between terminal buds of an author with terminal buds of the other author are identical.
  • a dendrogram will be called almost perfect if a branch carrying a style is worn in another branch of different style.
  • Figure 10 thus illustrates a first example of an almost perfect dendrogram.
  • the texts of the author A are carried by the branch which bears the author B.
  • the distances between the texts of B are greater than the distances between the texts of A.
  • Figure 11 illustrates another example of an almost perfect dendrogram.
  • the texts of the author B are carried by the branch which bears the author A.
  • the relation between the texts of a pair of authors is considered almost perfect if the maximum distances between terminal buds of texts of an author is smaller than the minimum distances between terminal buds of texts of the other author.
  • a dendrogram will be called entangled in all other cases.
  • FIG. 12 illustrates an example of a dendrogram with partial entanglement.
  • the texts of the authors B and C are entangled. Neither the texts of the author B nor those of the author C are found worn
  • FIG. 13 illustrates an example of a dendrogram with a generalized entanglement.
  • the texts of the three authors are intricate.
  • the texts of any author A, B or C are found carried exclusively by a single branch.
  • the relationship between the texts of a pair of authors is considered intricate if the two conditions
  • an entangled dendrogram contains at least one pair of intricately related authors
  • an almost-perfect dendrogram contains no pair with intricate relation but at least one pair with almost perfect relation
  • the examination of the dendrograms can be done automatically for example by comparing the structures or the distances between the nodes or branches of the dendrograms.
  • a preliminary automation step consists in verifying the initial hypothesis: a series of texts attributed to each author. This assumption is validated if each main branch carries exclusively the texts of an author. An automated measurement of the distances between each leaf node makes it possible to evaluate the relevance of the initial hypothesis: the terminal buds of a main branch will generally have shorter distances between them than those prevailing between a bud. terminal of a main branch and a terminal bud of another main branch. Measuring distances for to validate the distribution of the texts of an author on a main branch is verified in the majority of the dendrograms. A type of dendrogrammes, the ultrametric dendrograms, allows a strict verification of this last proposition.
  • a UPGMA dendrogram is ultrametric because it is rooted and the distances between its root and its terminal buds are identical. This ultramétric Congress property makes it possible to strictly automate the examination of the UPGMA dendrograms, for example by comparing all the distances between terminal buds for each pair of authors.
  • a multiple comparison experiment can be made from distance measurement to the centroid centers defined for the sequences of each author. A score can be established.
  • the method can be used not only to authenticate the alleged perpetrator of a short text (ie, to check whether it is the real author), but also to identify the author. author of an anonymous text or signed by another person. For this purpose, it is possible, from a few texts, to search in a collection of texts texts that are closest to reference texts (eg texts of suspects previously identified in a forensic application).
  • the method of the invention makes it possible to determine whether a message (short text) can be attributed to a known author whose at least one other short or long text is known. For example, it allows subscribers to receive messages from a person - for example, to tweeter subscribers, or to subscribers of other social networks or e-mail recipients - from make sure that the short messages read come from the supposed author who signed the message, and not from an usurper.
  • This procedure can be repeated to compare a questioned message with some supposed usurper messages, and with some messages from a reference author. If one of these three matches (unknown, usurper, reference) classifies the message questioned with those of the usurper, the message is attributed, with a certain probability, to this usurper.
  • the method may be used in anti-spam or anti-phishing software to determine, possibly with other methods, the likelihood that the message will come from a usurper.
  • the usurper can be a spammer.
  • the compared messages can relate to very different subjects, the approach being independent of the specific vocabulary used.
  • the messages are preferably of the same nature - for example all e-mails, or smear messages.
  • Figure 6 is taken from an example with three authors of dummy letters, each having produced two letters of about 500 and 1750 characters.
  • the author questioned (bottom left group) in this test has also produced a document of a hundred characters only (squares at the bottom left of the figure). These texts were cut to a preferred size of about 500 characters, with a coverage of three.
  • FIG. 6 represents the first two dimensions of an ASM on trigrams taken from the fragments of texts obtained after cutting at about 500 characters and recovery (degree of coverage of 3).
  • the resulting coordinate matrix of this ASM is stored in a table.
  • FIG. 7 is established from an ASM and illustrates the distance of each text fragment to the barycenters of the three visible clusters on this ASM.
  • the figure represents in X the number of the extract and in Y the distance from this extract to the representative anoint. For example, the first 15 fragments are closer to the center of the cluster in the lower left and are part of this cluster.
  • This diagram makes it possible to identify the misplaced points of a cluster because they are closer to the center of gravity of another cluster. It is therefore possible to calculate the proportion of misplaced points from the data in this graph and to determine the probability of
  • Figure 8 shows the dendrogram obtained from the matrix of coordinates from an ASM.
  • clusters There are three main branches (clusters) containing the fragments of the texts placed in the following order, from top to bottom: 88 (cluster on the bottom left), 95 (cluster on the top left) and 90 (cluster on the right).
  • This non-hierarchical clustering dendrogram validates the existence and the clear separation of the three clusters, corresponding to three authors.
  • the dendrogram refinement technique measures the statistical robustness of the results of this dendrogram.
  • a non-hierarchical clustering dendrogram therefore clusters the very short text of 130 characters (0088R2.txt1) with the other fragments from both texts 0088L and 0088C, which together form the cluster at the bottom left.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
EP19724617.6A 2018-04-20 2019-04-12 Verfahren und vorrichtung zur überprüfung des autors einer kurznachricht Withdrawn EP3782054A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CH5102018 2018-04-20
CH8352018 2018-07-04
PCT/IB2019/053037 WO2019202450A1 (fr) 2018-04-20 2019-04-12 Procédé et dispositif de vérification de l'auteur d'un message court

Publications (1)

Publication Number Publication Date
EP3782054A1 true EP3782054A1 (de) 2021-02-24

Family

ID=66554450

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19724617.6A Withdrawn EP3782054A1 (de) 2018-04-20 2019-04-12 Verfahren und vorrichtung zur überprüfung des autors einer kurznachricht

Country Status (3)

Country Link
US (1) US11640501B2 (de)
EP (1) EP3782054A1 (de)
WO (1) WO2019202450A1 (de)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210174017A1 (en) * 2018-04-20 2021-06-10 Orphanalytics Sa Method and device for verifying the author of a short message

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102665023B1 (ko) * 2021-12-20 2024-05-13 부산대학교 산학협력단 타겟 시스템을 위한 관측 변수를 결정하는 방법 및 장치

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460036B1 (en) * 1994-11-29 2002-10-01 Pinpoint Incorporated System and method for providing customized electronic newspapers and target advertisements
US5758257A (en) * 1994-11-29 1998-05-26 Herz; Frederick System and method for scheduling broadcast of and access to video programs and other data using customer profiles
US7421418B2 (en) * 2003-02-19 2008-09-02 Nahava Inc. Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently
US9880995B2 (en) 2006-04-06 2018-01-30 Carole E. Chaski Variables and method for authorship attribution
US9495358B2 (en) * 2006-10-10 2016-11-15 Abbyy Infopoisk Llc Cross-language text clustering
WO2011139687A1 (en) * 2010-04-26 2011-11-10 The Trustees Of The Stevens Institute Of Technology Systems and methods for automatically detecting deception in human communications expressed in digital form
US20120254333A1 (en) * 2010-01-07 2012-10-04 Rajarathnam Chandramouli Automated detection of deception in short and multilingual electronic messages
US20190050388A1 (en) * 2016-02-22 2019-02-14 Orphanalytics Sa Method and device for detecting style within one or more symbol sequences
US11093476B1 (en) * 2016-09-26 2021-08-17 Splunk Inc. HTTP events with custom fields
US11164239B2 (en) * 2018-03-12 2021-11-02 Ebay Inc. Method, system, and computer-readable storage medium for heterogeneous data stream processing for a smart cart
EP3782054A1 (de) * 2018-04-20 2021-02-24 Orphanalytics SA Verfahren und vorrichtung zur überprüfung des autors einer kurznachricht

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210174017A1 (en) * 2018-04-20 2021-06-10 Orphanalytics Sa Method and device for verifying the author of a short message
US11640501B2 (en) * 2018-04-20 2023-05-02 Orphanalytics Sa Method and device for verifying the author of a short message

Also Published As

Publication number Publication date
US20210174017A1 (en) 2021-06-10
US11640501B2 (en) 2023-05-02
WO2019202450A1 (fr) 2019-10-24

Similar Documents

Publication Publication Date Title
Stamatatos et al. Clustering by authorship within and across documents
CN109635296B (zh) 新词挖掘方法、装置计算机设备和存储介质
Stamatatos Authorship Verification: A Review of Recent Advances.
EP3420468A1 (de) Verfahren und vorrichtung zur detektion des stils innerhalb von einer oder mehreren symbolsequenzen
Bouazizi et al. Sentiment analysis in twitter: From classification to quantification of sentiments within tweets
CN103336766A (zh) 短文本垃圾识别以及建模方法和装置
US11036818B2 (en) Method and system for detecting graph based event in social networks
CN110570199B (zh) 一种基于用户输入行为的用户身份检测方法及系统
Klyuev Fake news filtering: Semantic approaches
Wibowo et al. Comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on Bahasa Indonesia documents
WO2019202450A1 (fr) Procédé et dispositif de vérification de l'auteur d'un message court
JP6237639B2 (ja) 情報抽出システム、情報抽出方法および情報抽出用プログラム
Chiruzzo et al. HAHA 2019 dataset: A corpus for humor analysis in Spanish
Hofmann et al. The reddit politosphere: a large-scale text and network resource of online political discourse
Ceballos Delgado et al. Deception detection using machine learning
Samory et al. Quotes reveal community structure and interaction dynamics
CN107229605A (zh) 文本相似度的计算方法及装置
Ousirimaneechai et al. Extraction of trend keywords and stop words from thai facebook pages using character n-grams
CN111062199B (zh) 一种不良信息识别方法及装置
CN112016317A (zh) 基于人工智能的敏感词识别方法、装置及计算机设备
Orebaugh et al. Data mining instant messaging communications to perform author identification for cybercrime investigations
Tschuggnall et al. Reduce & attribute: Two-step authorship attribution for large-scale problems
Locker “Because the computer said so!”: Can computational authorship analysis be trusted?
CN103034657A (zh) 文档摘要生成方法和装置
Moreno-Sandoval et al. Celebrity profiling on twitter using sociolinguistic

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201117

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20221101