FR3030812A1

FR3030812A1 - AUTOMATIC ANALYSIS OF THE LITERARY QUALITY OF A TEXT ACCORDING TO THE PROFILE OF THE READER

Info

Publication number: FR3030812A1
Application number: FR1554546A
Authority: FR
Inventors: Quentin Pleple
Original assignee: Short Edition
Current assignee: Short Edition
Priority date: 2014-12-22
Filing date: 2015-05-21
Publication date: 2016-06-24
Also published as: FR3030811A1; FR3030809A1; FR3030810A1

Abstract

L'invention concerne un perfectionnement au procédé d'analyse de la qualité littéraire d'un texte, mise en œuvre par un programme d'ordinateur selon la demande de brevet FR 14 63074. Le perfectionnement consiste à tenir compte du profil du lecteur.The invention relates to an improvement in the method of analysis of the literary quality of a text, implemented by a computer program according to the patent application FR 14 63074. The improvement consists in taking into account the profile of the reader.

Description

PROCEDE D'ANALYSE AUTOMATIQUE DE LA QUALITE LITTERAIRE D'UN TEXTE SELON LE PROFIL DU LECTEUR Domaine technique La présente invention concerne un procédé d'analyse automatique de la qualité littéraire d'un texte selon le profil du lecteur. Par « qualité littéraire chiffrée d'un texte», on entend dans le cadre de l'invention, la qualité littéraire d'un texte qui lui est intrinsèque et qui se matérialise par une note discrète de 1 à 10, par un score continu dans [0, 1], comme un score réel 1-00, +00[ ou comme des labels « très bon », « bon », « moyen », etc.TECHNICAL FIELD The present invention relates to a method for automatically analyzing the literary quality of a text according to the profile of the reader. By "coded literary quality of a text", one understands within the framework of the invention, the literary quality of a text which is intrinsic to it and which materializes by a discrete note of 1 to 10, by a continuous score in [0, 1], as a real score 1-00, +00 [or as "very good", "good", "average", etc. labels.

Par « qualité littéraire connue d'un texte d'apprentissage», on entend dans le cadre de l'invention, la note donnée par un profil d'experts de lecture pour estimer la qualité de celui-ci. Etat de la technique De manière générale, le but de la catégorisation automatique de textes est d'apprendre à une machine informatique à classer un texte dans la bonne catégorie en se basant sur son contenu. On peut résoudre par des algorithmes de catégorisation, divers problèmes de catégorisation de textes. En ce qui concerne l'analyse de la qualité d'un texte littéraire ou scientifique, différentes approches ont déjà été effectuées et différents algorithmes de catégorisation mis en oeuvre. Il existe ainsi plusieurs travaux qui concernent la qualité d'un texte littéraire, mais la plupart ne sont pas pertinents car ils définissent la notion de qualité dans un sens qui leur est propre et donc non réellement indépendante des facteurs qui sont choisis.By "known literary quality of a learning text" is meant in the context of the invention, the rating given by a reading expert profile to estimate the quality of it. STATE OF THE ART In general, the purpose of automatic text categorization is to teach a computer machine to classify a text in the correct category based on its content. Categorization algorithms can solve various problems of categorization of texts. Regarding the analysis of the quality of a literary or scientific text, different approaches have already been made and different categorization algorithms implemented. There are thus several works that concern the quality of a literary text, but most are not relevant because they define the concept of quality in a sense that is their own and therefore not really independent of the factors that are chosen.

On peut citer le brevet US7200606 dans lequel la notion de qualité est considérée dans le sens de pertinence vis-à-vis d'une requête utilisateur. Ainsi, une des approches pertinentes est l'approche dite intrinsèque selon laquelle il s'agit d'utiliser des algorithmes de catégorisation pour classer des documents en fonction de caractéristiques textuelles (indicateurs) qui sont intrinsèques au texte: composition, éléments de style, précision du vocabulaire par rapport à un sujet, construction des raisonnements, orthographe, etc.One can quote the patent US7200606 in which the notion of quality is considered in the sense of relevance vis-à-vis a user request. Thus, one of the relevant approaches is the so-called intrinsic approach according to which it is a question of using categorization algorithms to classify documents according to textual characteristics (indicators) which are intrinsic to the text: composition, elements of style, precision vocabulary in relation to a subject, construction of reasoning, spelling, etc.

Les caractéristiques de tri relèvent d'approches orthographiques, lexicales et stylistiques très variables, parmi lesquelles la longueur des mots, régularité du vocabulaire, analyse des cooccurrences, usage de la ponctuation, détection d'erreurs grammaticales et orthographiques, facilité de lecture, liens lexicaux avec un thème ou un genre, etc.The characteristics of sorting depend on very variable orthographic, lexical and stylistic approaches, among which the length of the words, regularity of the vocabulary, analysis of the cooccurrences, use of the punctuation, detection of grammatical and orthographical errors, facility of reading, lexical links with a theme or genre, etc.

Ces caractéristiques liées au texte peuvent être complétées utilement par des méthodes sémantiques faites autour des rapports entre qualité et respect des règles orthographiques et typographiques, de la grammaire (mesure de qualité sur des n-grammes longs), de la capitalisation, de la densité du texte (rapport entre lettres et espaces) ou de son entropie (au niveau des mots, voire au niveau des caractères).These characteristics related to the text can be usefully supplemented by semantic methods made around the relationship between quality and respect for orthographic and typographic rules, grammar (measurement of quality on long n-grams), capitalization, density of the text. text (relationship between letters and spaces) or its entropy (at the level of words, or even at the level of characters).

La lexicométrie, méthode d'analyse quantitative des textes, peut s'avérer un outil utile pour la mesure de qualité ou de non-qualité, paradoxalement. Quels que soient les méthodes et les algorithmes de catégorisation retenus, la difficulté première réside dans le choix des indicateurs et de l'algorithme, et dans leur combinaison pour évaluer la qualité littéraire d'un texte.Lexicometry, a method of quantitative analysis of texts, can prove to be a useful tool for measuring quality or non-quality, paradoxically. Whatever methods and categorization algorithms are used, the main difficulty lies in the choice of indicators and the algorithm, and in their combination to evaluate the literary quality of a text.

On trouve peu de littérature qui s'intéresse à la qualité littéraire d'un texte par approche intrinsèque. On peut citer tout d'abord les publications [1] et [2] qui décrivent une extraction d'indicateurs intrinsèques à partir d'un texte littéraire brut puis une régression ou une classification pour atteindre la valeur cible que l'on cherche à déterminer. Le choix des indicateurs reste relativement sommaire, ce qui ne permet pas d'affiner avec une très bonne précision l'analyse de la qualité. La publication [3] divulgue une prédiction de la qualité à partir d'un nombre restreint d'articles de journal (journal « Wall Street Journal »). L'analyse selon cette publication reste basique, puisque seule une corrélation est établie entre chaque indicateur et une valeur cible réalisée sur une trentaine d'articles de référence. Le demandeur a déposé le 22 décembre 2014 sous le n° 14 63074 une demande de brevet française relative à un procédé d'analyse de la qualité littéraire d'un texte, qui permet d'assurer une meilleure précision de l'analyse.There is little literature that focuses on the literary quality of a text by intrinsic approach. We can first of all cite the publications [1] and [2] which describe an extraction of intrinsic indicators from a raw literary text and then a regression or a classification to reach the target value that one seeks to determine. . The choice of indicators remains relatively brief, which makes it difficult to refine the quality analysis with a very good precision. The publication [3] discloses a prediction of quality from a small number of newspaper articles ("Wall Street Journal"). The analysis according to this publication remains basic, since only a correlation is established between each indicator and a target value carried out on about thirty reference articles. The applicant filed on December 22, 2014 under No. 14 63074 a French patent application relating to a method for analyzing the literary quality of a text, which makes it possible to ensure a better accuracy of the analysis.

Il existe encore un besoin non pris en compte qui est d'améliorer l'analyse de la qualité littéraire en fonction du profil de lecteur, c'est-à-dire en fonction de ses goûts. Le but de l'invention est de répondre au moins en partie à ce besoin.There is still a need not taken into account which is to improve the analysis of literary quality according to the reader's profile, that is to say according to his tastes. The object of the invention is to respond at least in part to this need.

Exposé de l'invention Pour ce faire, l'invention a pour objet un procédé d'analyse de la qualité littéraire d'un texte selon le profil de lecteur, mis en oeuvre par un programme d'ordinateur, comprenant les étapes suivantes : a/ recevoir une pluralité de lecteurs dit lecteurs d'apprentissage ; a'/ recevoir une pluralité de textes dit textes d'apprentissage; b/ extraire pour chaque lecteur d'apprentissage et pour les textes d'apprentissage, ses indicateurs propres que sont le nombre de textes lus, le rapport entre sa manifestation d'intérêt et le nombre des textes lus, et les moyennes des notes données aux textes lus ; c/ générer une représentation vectorielle de chaque lecteur d'apprentissage à partir de ses indicateurs propres; d/ soumettre les différentes représentations vectorielles des lecteurs d'apprentissage à un classifieur de partitionnement de données de sorte à obtenir des groupes de lecteurs d'apprentissage ; e/ générer une représentation vectorielle pour chaque groupe de lecteurs d'apprentissage en faisant la moyenne des représentations vectorielles des lecteurs du groupe; b'/ extraire les indicateurs numériques de chaque texte d'apprentissage, c'/ générer une représentation vectorielle de chaque texte d'apprentissage à partir de ses indicateurs numériques, f/ pour chaque groupe de lecteurs, faire l'apprentissage entre les composantes de la représentation vectorielle des textes d'apprentissage et des représentations vectorielles des groupes de lecteurs et la qualité littéraire connue de chaque texte d'apprentissage, de sorte à obtenir un modèle prédictif de la qualité littéraire en fonction des groupes de lecteurs; g/ recevoir un nouveau texte à analyser ; h/ appliquer au nouveau texte et à partir d'un groupe donné de lecteurs le modèle prédictif construit à l'étape d/, de sorte à obtenir la qualité littéraire du nouveau texte selon le groupe de lecteurs donné.DISCLOSURE OF THE INVENTION To this end, the subject of the invention is a method for analyzing the literary quality of a text according to the reader profile, implemented by a computer program, comprising the following steps: / receive a plurality of readers said learning readers; to receive a plurality of texts, called learning texts; b / extract for each learning reader and for the learning texts, its own indicators that are the number of texts read, the ratio between its expression of interest and the number of texts read, and the averages of the notes given to the read texts; c / generate a vector representation of each learning reader from its own indicators; d / subjecting the different vector representations of the learning readers to a data partitioning classifier so as to obtain groups of learning readers; e / generate a vector representation for each group of learning readers by averaging the vector representations of the group's readers; b / extract the numerical indicators of each learning text, c / generate a vector representation of each learning text from its numerical indicators, f / for each group of readers, to learn between the components of the vector representation of the learning texts and vector representations of the groups of readers and the known literary quality of each text of learning, so as to obtain a predictive model of the literary quality according to the groups of readers; g / receive a new text to analyze; h / apply to the new text and from a given group of readers the predictive model built in step d /, so as to obtain the literary quality of the new text according to the given group of readers.

On précise que selon l'invention, la partition de données (« data clustering » en anglais) permet d'obtenir des groupes de lecteurs qui définissent donc chacun un profil de lecteurs. L'inventeur de la présente invention est parti du constat que l'invention selon la demande de brevet FR 14 63074 précitée permet bien d'améliorer la précision de l'analyse littéraire d'un texte, mais qu'elle n'était pas complètement satisfaisante car elle ne permet pas d'avoir une qualité de texte en fonction du profil du lecteur, c'est-à-dire en fonction de ses goûts. Aussi, l'invention consiste essentiellement à prendre en compte le profil de lecteur dans la prédiction de la qualité d'un texte littéraire. L'invention concerne également un programme d'ordinateur de mise en oeuvre du procédé décrit précédemment. Description détaillée D'autres avantages et caractéristiques de l'invention ressortiront mieux à la lecture de la description détaillée d'exemples de mise en oeuvre de l'invention faite à titre illustratif et non limitatif en référence aux figures suivantes : - figure 1 : représentation schématique de la représentation vectorielle d'un groupe de lecteurs d'apprentissage selon leur profil ; - figure 2 : organigramme des étapes d'apprentissage du procédé selon l'invention mis en oeuvre par programme d'ordinateur ; - figure 3 : organigramme des étapes de prédiction de l'analyse de la qualité littéraire subjective du procédé selon l'invention, également mis en oeuvre par programme d'ordinateur. Par la suite, on utilise indifféremment les termes « algorithme » et «programme d'ordinateur » qui est le codage lisible par un ordinateur de l'algorithme. Ainsi, un algorithme est un plan d'exécution pour un ordinateur. L'ordinateur prend des données entrantes, applique le traitement décrit par l'algorithme et renvoie en retour un résultat à l'utilisateur. Dans le cadre de l'invention, l'algorithme mis en oeuvre pour l'analyse prédictive est un algorithme d'apprentissage automatique (« machine learning » en anglais). Dans ce type d'algorithme, ses règles de décisions ne sont pas fixées à la conception, car il est conçu pour qu'il puisse modifier ses règles de décisions, en fonction des données qu'il voit. Le procédé proprement dit selon l'invention, comprend trois phases successives, la première étant une phase de « data clustering, » la deuxième étant une phase d'apprentissage et la dernière étant une phase de prédiction successives. On réalise tout d'abord la phase de clustering. On considère tout d'abord toutes les métadonnées incluses dans chaque texte d'apprentissage que sont par exemple : - le genre littéraire : romance, drame, policier, haiku, alexandrin... - l'âge du lecteur du texte : 6, 8, 10... ans, - les émotions exprimées, c'est-à-dire qui sont ressenties par les personnages des textes. En ce qui concerne les émotions considérées, on choisit de préférence les six suivantes: bonheur, affection, intérêt, tristesse, mélancolie, colère et peur.It is specified that according to the invention, the data partitioning ("data clustering" in English) makes it possible to obtain groups of readers, which therefore each define a profile of readers. The inventor of the present invention started from the observation that the invention according to the aforementioned patent application FR 14 63074 makes it possible to improve the accuracy of the literary analysis of a text, but that it was not completely satisfactory because it does not allow to have a quality of text according to the profile of the reader, that is to say according to his tastes. Also, the invention essentially consists in taking into account the reader's profile in the prediction of the quality of a literary text. The invention also relates to a computer program for implementing the method described above. DETAILED DESCRIPTION Other advantages and characteristics of the invention will emerge more clearly from a reading of the detailed description of exemplary embodiments of the invention, given by way of nonlimiting illustration with reference to the following figures: FIG. 1: representation schematic of the vector representation of a group of learning readers according to their profile; FIG. 2 is a flowchart of the learning steps of the method according to the invention implemented by computer program; FIG. 3 is a flow diagram of the steps for predicting the analysis of the subjective literary quality of the method according to the invention, also implemented by computer program. Subsequently, the terms "algorithm" and "computer program" which is the computer-readable coding of the algorithm are used interchangeably. Thus, an algorithm is an execution plan for a computer. The computer takes incoming data, applies the processing described by the algorithm and returns a result to the user. In the context of the invention, the algorithm implemented for the predictive analysis is a machine learning algorithm. In this type of algorithm, his decision rules are not fixed to the design, because it is designed so that he can modify his decision rules, according to the data he sees. The process itself according to the invention comprises three successive phases, the first being a "data clustering" phase, the second being a learning phase and the last being a successive prediction phase. First of all, we realize the clustering phase. We first consider all the metadata included in each learning text that are for example: - the literary genre: romance, drama, policeman, haiku, alexandrine ... - the age of the reader of the text: 6, 8 , 10 ... years, - emotions expressed, that is to say, who are felt by the characters of the texts. With regard to the emotions considered, the following six are preferably chosen: happiness, affection, interest, sadness, melancholy, anger and fear.

Pour un lecteur donné, on prend alors en compte des mesures de son goût, c'est-à-dire des indicateurs qui lui sont propres, pour chaque métadonnée de chaque texte d'apprentissage. Ces indicateurs qui lui sont propres sont établis comme suite : - le nombre de ces textes lus ; - le rapport entre sa manifestation d'intérêt (« like » en anglais) et le nombre des textes lus des moyennes, - la moyenne de ses notes. L'algorithme génère alors une représentation vectorielle de chaque lecteur d'apprentissage à partir de ses indicateurs propres. On regroupe alors les lecteurs d'apprentissage en n groupes selon une technique usuelle de « data clustering », par proximité des représentations vectorielles. A titre d'exemple, n est égal à 20. On précise que la proximité est à considérer au sens mathématique du terme, c'est-à-dire qu'elle est obtenue en calculant le cosinus entre les représentations vectorielles. Pour chaque groupe de lecteurs qui détermine donc un profil de lecteurs, on génère une représentation vectorielle pour chaque groupe de lecteurs en faisant la moyenne des représentations vectorielles des lecteurs du groupe (étape Si).For a given reader, one then takes into account measures of his taste, that is to say indicators that are specific to him, for each metadata of each text of learning. These indicators, which are specific to it, are established as follows: - the number of these texts read; - the ratio between its expression of interest ("like" in English) and the number of texts read averages, - the average of its notes. The algorithm then generates a vector representation of each learning reader from its own indicators. The learning readers are then grouped into n groups according to a usual "data clustering" technique, by proximity of the vector representations. By way of example, n is equal to 20. It is specified that proximity is to be considered in the mathematical sense of the term, that is to say that it is obtained by calculating the cosine between vector representations. For each group of readers which thus determines a profile of readers, a vector representation for each group of readers is generated by averaging the vector representations of the readers of the group (step Si).

En parallèle, l'algorithme réalise ensuite les étapes suivantes, à partir des textes d'apprentissage à analyser. On extrait les indicateurs numériques des textes d'apprentissage (étape S'O). Puis, on génère une représentation vectorielle de chaque texte d'apprentissage à partir de ses indicateurs numériques (étape S'1). L'algorithme peut avantageusement procéder de la manière suivante pour la construction de la représentation vectorielle, à partir d'un texte brut à analyser. Il génère plusieurs sous-représentations vectorielles du texte reçu pour obtenir des indicateurs bas-niveau.In parallel, the algorithm then performs the following steps, from the learning texts to be analyzed. The numerical indicators are extracted from the learning texts (step S'O). Then, a vector representation of each learning text is generated from its digital indicators (step S'1). The algorithm can advantageously proceed in the following manner for the construction of the vector representation, from a raw text to be analyzed. It generates several vector sub-representations of the received text to obtain low-level indicators.

La première sous-représentation consiste en une représentation par sac de mots selon laquelle on analyse les distributions de chaque mot et on analyse les distributions de certains unigrams, bi-grams, 3-grams, 4-grams, 5-grams et 6-grams à l'échelle du mot et des caractères. Ainsi, dans cette étape, le texte est transformé en une suite de tokens selon des expressions régulières de découpage. La représentation par sac-de-mots ne tient pas compte de la mise en forme du texte, de l'ordre des mots, de leur sens ou des relations structurées par des mots de liaison. La deuxième sous-représentation représente la structure morphosyntaxique, selon laquelle on calcule les paramètres des distributions des mots grammaticaux dans le texte et on analyse les distributions de chaque fonction syntaxique dans le texte, les paragraphes, les phrases et les propositions. Les mots grammaticaux sont les articles, les prépositions, les adjectifs non qualificatifs. Le calcul des paramètres de la distribution des mots grammaticaux est fait à partir de critères choisis parmi la moyenne, la variance, l'écart type, l'entropie, la distance entre les distributions ou une combinaison de ceux-ci. Une fonction syntaxique est un verbe, un nom, un adjectif, un adverbe, un déterminant, une préposition. Ainsi, cette étape permet d'extraire des éléments de structure du texte dans pour autant monter jusqu'au niveau pragmatique de la compréhension générale du texte. La troisième sous-représentation représente des fautes d'écriture selon laquelle on calcule le nombre de fois où chaque règle de chacune des catégories de fautes d'écriture n'est pas respectée. Les fautes d'écriture sont les fautes d'orthographe, de grammaire, de conjugaison, d'anglicisme, de syntaxe, d'expression, et d'usage. Ainsi, cette étape consiste à analyser automatiquement les différents types de fautes apparaissant dans le texte. La quatrième sous-représentation représente la stylométrie selon laquelle on calcule la longueur du texte, la longueur des paragraphes, la longueur des phrases, la longueur des propositions, la longueur des mots en caractères, le nombre de chaque signe de ponctuation, et enfin les paramètres de la distribution des dialogues dans le texte. La longueur du texte est calculée à partir du nombre de paragraphes, phrases, propositions, mots, caractères. La longueur d'un paragraphe est calculée à partir du nombre de phrases, propositions, mots, caractères. La longueur des phrases est calculée à partir du nombre de propositions, mots, caractères. La longueur des propositions est calculée à partir du nombre de mots, caractères. Le calcul des paramètres de la distribution des dialogues dans le texte est fait à partir de critères choisis parmi la moyenne, la variance, l'écart type, l'entropie, la distance entre les distributions ou une combinaison de ceux-ci. Ainsi, cette étape permet d'identifier le style du texte.The first under-representation consists of a word bag representation that analyzes the distributions of each word and analyzes the distributions of certain unigrams, bi-grams, 3-grams, 4-grams, 5-grams, and 6-grams. at the scale of the word and characters. Thus, in this step, the text is transformed into a series of tokens according to regular expressions of cutting. Word-bag representation does not take into account the formatting of text, the order of words, their meaning or relationships structured by words of connection. The second under-representation represents the morphosyntactic structure, which calculates the parameters of grammatical word distributions in the text and analyzes the distributions of each syntactic function in text, paragraphs, sentences, and propositions. Grammatical words are articles, prepositions, non-qualifying adjectives. The calculation of the parameters of the distribution of the grammatical words is made from criteria chosen from the mean, the variance, the standard deviation, the entropy, the distance between the distributions or a combination of these. A syntactic function is a verb, a noun, an adjective, an adverb, a determinant, a preposition. Thus, this step makes it possible to extract elements of structure of the text in so far as to go up to the pragmatic level of the general comprehension of the text. The third underrepresentation represents writing errors according to which the number of times each rule of each of the categories of writing faults is not respected is calculated. The errors of writing are the errors of spelling, grammar, conjugation, anglicism, syntax, expression, and use. Thus, this step consists in automatically analyzing the different types of faults appearing in the text. The fourth under-representation represents the stylometry according to which the length of the text, the length of the paragraphs, the length of the sentences, the length of the propositions, the length of the words in characters, the number of each punctuation mark, and finally the parameters of the dialog distribution in the text. The length of the text is calculated from the number of paragraphs, sentences, propositions, words, characters. The length of a paragraph is calculated from the number of sentences, propositions, words, characters. The length of sentences is calculated from the number of propositions, words, characters. The length of the proposals is calculated from the number of words, characters. The calculation of the dialogue distribution parameters in the text is made from criteria chosen from the mean, the variance, the standard deviation, the entropy, the distance between the distributions or a combination of these. Thus, this step identifies the style of the text.

A partir de toutes les sous-représentations précédentes, l'algorithme génère une cinquième sous-représentation qui est une méta-description selon laquelle on analyse le vocabulaire du texte par les différents niveaux de rareté des mots, les champs lexicaux utilisés, les mots adaptés à la jeunesse, et on calcule des agrégations (sommes) et ratios (divisions) des indicateurs bas-niveau obtenus précédemment.From all the preceding sub-representations, the algorithm generates a fifth under-representation which is a meta-description according to which the vocabulary of the text is analyzed by the different levels of rarity of the words, the lexical fields used, the adapted words to youth, and we calculate aggregations (sums) and ratios (divisions) of the low-level indicators obtained previously.

On donne ci-après un exemple d'agrégation calculé à partir d'indicateurs bas niveau qui sont les suivants: - NIN = nombre de verbes à l'infinitif - NPR = nombre de verbes au présent - NFU = nombre de verbes au futur - NPA = nombre de verbes au passé. L'agrégation calculé donne un indicateur de niveau intermédiaire NV qui est le nombre total de verbes, soit NV = NIN + NPR + NFU + NPA. On donne ci-après un exemple de ratio calculé à partir d'indicateurs bas niveau qui sont les suivants: - NP = nombre de phrases - NV = nombre de verbes.An example of aggregation calculated from low-level indicators is given below: - NIN = number of verbs in the infinitive - NPR = number of verbs in the present - NFU = number of verbs in the future - NPA = number of verbs in the past. Calculated aggregation yields an intermediate level indicator NV which is the total number of verbs, NV = NIN + NPR + NFU + NPA. An example of a ratio calculated from low-level indicators is given below: - NP = number of sentences - NV = number of verbs.

Le ratio calculé donne un indicateur de niveau intermédiaire NM qui est le nombre moyen de verbes par phrases, soit NM = NV / NP. Ainsi, cette étape permet d'obtenir des méta-descriptions telles que la lisibilité, l'étendue du vocabulaire ou la cohésion lexicale.The calculated ratio gives an intermediate level indicator NM which is the average number of verbs per sentence, ie NM = NV / NP. Thus, this step makes it possible to obtain meta-descriptions such as readability, the extent of the vocabulary or the lexical cohesion.

A partir de la sous-représentation par sac-de-mots, l'algorithme génère une sixième sous-représentation qui représente des champs lexicaux présents dans le texte, par une analyse en composantes principales (PCA, acronyme anglais pour « Principal Components Analysis ») et/ou une analyse sémantique latente (LSA, acronyme anglais pour « Latent Semantic Analysis ») et/ou une factorisation en matrices non négatives (NMF, acronyme anglais pour « Non-negative Matrix Factorization »). Il s'agit donc ici d'une étape de réduction de dimensionnalité pour obtenir des champs lexicaux. Lorsqu'on obtient trop de champs lexicaux par ces trois analyses, l'algorithme génère une étape supplémentaire de réduction de la dimensionnalité. Cette étape consiste donc à mettre tous les champs lexicaux ensemble et à n'en conserver qu'un nombre restreint afin que ceux conservés soient des champs uniques et pertinents. Autrement dit, en cas de redondance dans les composantes du vecteur généré selon l'étape précédente, cette étape permet de sélectionner les composantes non redondantes du vecteur. Une fois toutes les sous-représentations vectorielles générées, l'algorithme réalise leur concaténation en une représentation finale du texte.From the word-bag sub-representation, the algorithm generates a sixth under-representation which represents lexical fields present in the text, by a principal components analysis (PCA). and / or latent semantic analysis (LSA) and / or non-negative Matrix Factorization (NMF). This is therefore a dimensionality reduction step to obtain lexical fields. When one obtains too many lexical fields by these three analyzes, the algorithm generates an additional step of reduction of the dimensionality. This step consists in putting all the lexical fields together and keeping only a small number of them so that those kept are unique and relevant fields. In other words, in case of redundancy in the components of the vector generated according to the preceding step, this step makes it possible to select the non-redundant components of the vector. Once all the vector sub-representations have been generated, the algorithm realizes their concatenation into a final representation of the text.

L'algorithme fait alors l'apprentissage entre les composantes de la représentation vectorielle des textes d'apprentissage et des représentations vectorielles des groupes de lecteurs et la qualité littéraire connue de chaque texte d'apprentissage, de sorte à obtenir un modèle prédictif de la qualité littéraire en fonction des groupes de lecteurs (étape S2).The algorithm then learns between the components of the vector representation of the learning texts and the vector representations of the groups of readers and the known literary quality of each text of learning, so as to obtain a predictive model of the quality literary according to the groups of readers (step S2).

On précise que pour obtenir la qualité littéraire connue de chaque texte d'apprentissage, on considère la population d'experts de lecture (lecteurs d'apprentissage). Chaque lecteur donne également une note pour chaque texte littéraire d'apprentissage. Les notes d'un groupe donné obtenu par le clustering précédent mesurent la qualité littéraire de chaque texte pour ledit groupe et elle sont pondérées en étant centrées puis réduites selon 1 ' équation: x' = (x - m) / s où : x est la note donnée entre 1 et 10 par un individu M pour une oeuvre, m est la moyenne des notes données par M, s est l'écart-type des notes données par M, x' est la nouvelle note corrigée.It is specified that to obtain the known literary quality of each text of learning, we consider the population of reading experts (learning readers). Each reader also gives a mark for each literary text of learning. The notes of a given group obtained by the preceding clustering measure the literary quality of each text for said group and are weighted by being centered and then reduced according to the equation: x '= (x - m) / s where: x is the note given between 1 and 10 by an individual M for a work, m is the average of the notes given by M, s is the standard deviation of the notes given by M, x 'is the new corrected note.

Ainsi, x' quantifie la qualité littéraire connue d'un groupe donné d'experts ou lecteurs d'apprentissage. Pour établir la prédiction de la qualité littéraire d'un nouveau texte l'algorithme procède comme suit. On applique au nouveau texte et à partir d'un groupe donné de lecteurs le modèle prédictif construit précédemment de sorte à obtenir la qualité littéraire du nouveau texte selon le groupe de lecteurs donné (étape S3). L'invention qui vient d'être décrite permet ainsi d'obtenir de manière précise et fiable la qualité littéraire de tout texte selon un profil de lecteurs, le profil de lecteurs étant déterminé au préalable par clustering des lecteurs d'apprentissage.Thus, x 'quantifies the known literary quality of a given group of experts or learning readers. To establish the prediction of the literary quality of a new text the algorithm proceeds as follows. The previously constructed predictive model is applied to the new text and from a given group of readers so as to obtain the literary quality of the new text according to the given group of readers (step S3). The invention just described thus makes it possible to obtain accurately and reliably the literary quality of any text according to a reader profile, the reader profile being determined beforehand by clustering the learning readers.

De nombreuses variantes et améliorations peuvent être envisagées sans pour autant sortir du cadre de l'invention.Many variants and improvements can be envisaged without departing from the scope of the invention.

REFERENCE S CITEES [1]: «DEFT2014, analyse automatique de textes littéraires et scientifiques en langue française», Lecluze and al., 21ème Traitement Automatique des Langues Naturelles, Marseille, 2014; [2] : « Catégorisation sémantique fine des expressions d'opinion pour la détection de consensus », Benamara and al., 21ème Traitement Automatique des Langues Naturelles, Marseille, 2014; [3] : « Revisiting Readability: A Unified Framework for Predicting Text Quality », Pitler and al. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 186-195.15REFERENCE S CITEES [1]: "DEFT2014, automatic analysis of literary and scientific texts in French", Lecluze et al., 21st Automatic Processing of Natural Languages, Marseille, 2014; [2]: "Fine Semantic Categorization of Expression of Opinion for Consensus Detection", Benamara et al., 21st Automatic Language Processing, Marseille, 2014; [3]: "Revisiting Readability: A Unified Framework for Predicting Text Quality", Pitler et al. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 186-195.15

Claims

REVENDICATIONS1. A method of analyzing the literary quality of a text according to the reader profile, implemented by a (micro) computer processor, comprising the steps of: a / receiving a plurality of readers said learning readers; to receive a plurality of texts, called learning texts; b / extract for each learning reader and for the learning texts, its own indicators that are the number of texts read, the ratio between its expression of interest and the number of texts read, and the averages of the notes given to the read texts; c / generate a vector representation of each learning reader from its own indicators; d / subjecting the different vector representations of the learning readers to a data partitioning classifier so as to obtain groups of learning readers; e / generate a vector representation for each group of learning readers by averaging the vector representations of the group's readers; b / extract the numerical indicators of each learning text, c / generate a vector representation of each learning text from its numerical indicators, f / for each group of readers, to learn between the components of the vector representation of the learning texts and vector representations of the groups of readers and the known literary quality of each text of learning, so as to obtain a predictive model of the literary quality according to the groups of readers; g / receive a new text to analyze; h / apply to the new text and from a given group of readers the predictive model built in step d /, so as to obtain the literary quality of the new text according to the given group of readers.

2. Analysis method according to claim 1, comprising, for the vector generation of a learning text according to step c /, the following steps: cl / generating several vector sub-representations of the received text to obtain indicators , so-called low-level indicators, the sub-representations consisting in: - a word bag representation according to which the distributions of each word are analyzed and the distributions of certain unigrams, bi-grams, are analyzed;

3-grams,

4-grams,

5-grams and

6-gram on the scale of the word and characters, - a representation called morphosyntactic structure, according to which one calculates the parameters of the distributions of the grammatical words in the text and one analyzes the distributions of each syntactic function in the text, the paragraphs , sentences and propositions, - a representation of the writing errors according to which one calculates the number of times each rule of each category of writing mistakes is not respected, - a representation of stylometry according to which one computes the length of the text, the length of the paragraphs, the length of the sentences, the length of the propositions, the length of the words in characters, the number of each punctuation mark, and finally the parameters of the distribution of the dialogs in the text; c2 / generate: - a meta-description according to which one analyzes the vocabulary of the text by the different levels of rarity of the words, the lexical fields used, the words adapted to the youth, and one calculates aggregations and ratios of the indicators low-level obtained in 20 cil; a representation of the lexical fields present in the text from the word bag presentation performed in eyelash, by principal component analysis (PCA) and / or a latent semantic analysis (LSA). , acronym for "Latent Semantic Analysis") and / or a non-negative Matrix Factorization (NMF). c3 / concatenation of vector under-representation generated in cl / and c2 /. A computer program comprising program code instructions for executing the steps of the method according to the preceding claim when said program is run on a computer.