ES2320511B1

ES2320511B1 - NEW METHOD TO DETERMINE THE REPRESENTATIVITY OF A CORPUS.

Info

Publication number: ES2320511B1
Application number: ES200603157A
Authority: ES
Inventors: Gloria Corpas Pastor; Miriam Seghiri Dominguez; Romano Maggi
Original assignee: Universidad de Malaga
Current assignee: Universidad de Malaga
Priority date: 2006-12-05
Filing date: 2006-12-05
Publication date: 2010-02-03
Anticipated expiration: 2026-12-05
Also published as: ES2320511A1

Abstract

Nuevo método para determinar la representatividad de un corpus.New method to determine the representativeness of a corpus.

La presente invención supone una solución eficaz para determinar a posteriori el tamaño mínimo de un corpus o colección textual, independientemente de la lengua o tipo textual de dicha colección, estableciendo, por tanto, el umbral mínimo de representatividad a través de un algoritmo (N-Cor) de análisis de la densidad léxica en función del aumento incremental del corpus. A partir de esta premisa se ha llegado a una propuesta de implementación en ordenador que se ha concretado en una aplicación desarrollada en Java, y que hemos denominado ReCor. Dicho sistema posee las siguientes clases principales: a) Palabras (algoritmo de cómputo, lectura y escritura a archivo); b) Gui (interfaz de usuario); y c) Ventana Gráfica (adaptador para la representación gráfica).The present invention is an effective solution to retrospectively determine the minimum size of a corpus or collection textual, regardless of the language or text type of the collection, establishing therefore the minimum threshold representation through an algorithm (N- Cor) of lexical density analysis based on the incremental increase of the corpus. From this premise, we have reached a proposal for computer implementation that has been specified in an application developed in Java, and that we have called ReCor. This system has the following main classes: a) Words (algorithm of computation, reading and writing to file); b) Gui (user interface); and c) Graphic window (adapter for graphic representation).

Description

       \global\parskip0.900000\baselineskip\ global \ parskip0.900000 \ baselineskip

Technical sector

La presente invención se refiere a un método de procesamiento de datos implementado en ordenador, particularmente datos e información lingüística.The present invention relates to a method of data processing implemented in computer, particularly linguistic data and information

Estado de la técnica La cuestión de la representatividad sigue siendo hoy día uno de los aspectos más controvertidos de la lingüística del corpus. En el caso de los corpus especializados, los cuales suelen tener un tamaño mucho más reducido que los denominados "corpus generales" o "de referencia", la cuestión de la representatividad es realmente clave, es más, es una de sus características definitorias.State of the art The question of representativeness is still one of the most important aspects controversial of the linguistics of the corpus. In the case of specialized corpus, which usually have a much more size reduced than the so-called "general corpus" or "of reference ", the question of representativeness is really key, moreover, is one of its defining characteristics.

Dejando a un lado que la representatividad de un corpus depende, en primer lugar, de haber aplicado los criterios de diseño externos e internos adecuados, en la práctica la cuantificación del tamaño mínimo que debe tener un corpus especializado aún no se ha abordado de forma objetiva. Y es que no hay consenso sobre cuál sea el número mínimo de documentos o palabras que debe tener un determinado corpus para que sea considerado válido y representativo de la población que se desea representar. Las cifras varían de forma espectacular de unos autores a otros. Así, si para Biber (1995. Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press), 1000 palabras y 10 documentos son suficientes para asegurar la representatividad de un corpus especializado; según Friedblicher y Friedblicher (2000. The Argument for Using English Specialized Corpora to Understand Academic and Professional Language. Discourse in the Professions: Perspectives From Corpus Linguistics. John Benjamins), el tamaño oscila entre 500.000 y 5.000.000 palabras; mientras que McEnery y Wilson (2006 [2000]. ICT4LT Module 3.4. Corpus Linguistics. <http://www.ict4lt.org/en/en_mod3-4.htm> [09/11/2006]) sitúan el límite en 1.000.000 palabras. Pero todas estas cifras no resuelven el problema de calcular la representatividad de un corpus, dado que son cifras establecidas a priori, carentes de fundamento objetivo, medible y cuantificable.Leaving aside that the representativeness of a corpus depends, first of all, on having applied the appropriate external and internal design criteria, in practice the quantification of the minimum size that a specialized corpus must have has not yet been objectively addressed . And there is no consensus on what is the minimum number of documents or words that a given corpus must have in order to be considered valid and representative of the population that you want to represent. The figures vary dramatically from one author to another. Thus, if for Biber (1995. Dimensions of Register Variation: A cross-linguistic comparison . Cambridge University Press), 1000 words and 10 documents are sufficient to ensure the representativeness of a specialized corpus; according to Friedblicher and Friedblicher (2000. The Argument for Using English Specialized Corpora to Understand Academic and Professional Language . Discourse in the Professions: Perspectives From Corpus Linguistics. John Benjamins), the size ranges from 500,000 to 5,000,000 words; while McEnery and Wilson (2006 [2000]. ICT4LT Module 3.4. Corpus Linguistics. <http://www.ict4lt.org/en/en_mod3-4.htm> [09/11/2006]) set the limit at 1,000 .000 words. But all these figures do not solve the problem of calculating the representativeness of a corpus, since they are a priori established figures, lacking objective, measurable and quantifiable basis.

Detailed description of the invention

La presente invención supone una solución eficaz para determinar a posteriori el tamaño mínimo de un corpus o colección textual, independientemente de la lengua o tipo textual de dicha colección, estableciendo, por tanto, el umbral mínimo de representatividad a través de un algoritmo (N-Cor) de análisis de la densidad léxica en función del aumento incremental del corpus.The present invention is an effective solution to retrospectively determine the minimum size of a corpus or collection textual, regardless of the language or text type of the collection, establishing therefore the minimum threshold representation through an algorithm (N- Cor) of lexical density analysis based on the incremental increase of the corpus.

A partir de esta premisa se ha llegado a una propuesta de implementación en ordenador que se ha concretado en una aplicación desarrollada en Java, y que hemos denominado ReCor. Dicho sistema posee las siguientes clases principales: a) Palabras (algoritmo de cómputo, lectura y escritura a archivo); b) Gui (interfaz de usuario); y c) VentanaGrafica (adaptador para la representación gráfica).From this premise has reached a proposal of implementation in computer that has been specified in an application developed in Java, and that we have called ReCor. This system has the following main classes: a) Words (algorithm of computation, reading and writing to file); b) Gui (user interface); and c) Graphic Window (adapter for graphic representation).

Description of the drawings

Figura 1: Ciclo de vida del uso del sistemaFigure 1: System use life cycle

Figura 2: Ejemplificación de representaciones gráficas A y BFigure 2: Exemplification of representations graphs A and B

Figura 3: Implementación de la ventana gráficaFigure 3: Window implementation graph

Figura 4: Clase OrdenFrecuencia definida en método compareTo de la interfaz ComparableFigure 4: Order class Frequency defined in the compareTo method of the Comparable interface

Figura 5: Clase Gui para la creación del interfaz gráfico de usuarioFigure 5: Gui class for the creation of graphical user interface

Figura 6: Clase Palabra para el análisis del corpusFigure 6: Word class for the analysis of corpus

Figura 7: Clase Controlador que especifica la acción asociada a cada eventoFigure 7: Controller class that specifies the action associated with each event

Figura 8: Clase ruta con método main.Figure 8: Route class with main method.

Embodiments of the invention N-Cor algorithm

Como se expone más arriba, el presente método calcula el tamaño mínimo de un corpus mediante el análisis de la densidad léxica (d) en relación a los aumentos incrementales del corpus (C) documento a documento, según muestra la siguiente ecuación:As set forth above, the present method calculate the minimum size of a corpus by analyzing the lexical density (d) in relation to the incremental increases of corpus (C) document by document, as shown in the following equation:

1one

Para ello, se analizan gradualmente todos los archivos que componen el corpus, extrayendo información sobre la frecuencia de las palabras tipo (types) y las ocurrencias o palabras distintas (tokens) de cada archivo del corpus. En esta operación se utilizan dos criterios de selección de archivos, a saber, por orden alfabético y de forma aleatoria,To do this, all the files that make up the corpus are analyzed gradually, extracting information about the frequency of the type words ( types ) and the occurrences or different words ( tokens ) of each file of the corpus. In this operation, two file selection criteria are used, namely in alphabetical order and randomly,

22

donde:where:

       \global\parskip1.000000\baselineskip\ global \ parskip1.000000 \ baselineskip

Ty: Se refiere a los types, es decir el número de palabras distintas hasta ese momento.Ty: It refers to types , that is, the number of different words up to that time.

To: Muestra los tokens, es decir el número de palabras en total hasta ese momento.To: Shows the tokens , that is, the total number of words up to that time.

N: Número de documentos que componen el corpus.N: Number of documents that make up the corpus

El ciclo de vida del uso del sistema puede ser el siguiente (Figura 1): Cada archivo que integra el corpus debe estar identificado de forma unívoca mediante un nombre en código alfanumérico (por ejemplo, A001, A002, A003 ... A00n). El algoritmo opera primero por orden alfabético y, a continuación, de forma aleatoria, a fin de garantizar que el orden en el que son seleccionados los archivos no afecte al resultado. Cuando se seleccionan los documentos por orden alfabético, el algoritmo analiza el primer archivo (por ejemplo, A001) y para éste se calculan los tokens (To) y los types (Ty), y la densidad léxica correspondiente. Con ello ya se obtiene un punto en la representación gráfica que se pretende extraer. A continuación, siguiendo el mismo criterio de selección que en el primero, se toma el siguiente documento del corpus (por ejemplo, A002) y se calculan de nuevo los To y Ty para éste, pero sumando los resultados a los Ty y To de la iteración anterior (en este caso a los del primer documento analizado), se calcula la densidad léxica y con esto se obtiene un segundo punto para la representación gráfica. Se sigue este algoritmo hasta que se hayan tratado todos los documentos que componen el corpus que se estudia, que en la presente ejemplificación sería el A00n. La segunda fase del análisis toma los documentos en orden aleatorio, por ejemplo el A003 primero, luego el A00n, y así hasta haber analizado todos los documentos del corpus.The life cycle of using the system can be as follows (Figure 1): Each file that integrates the corpus must be uniquely identified by an alphanumeric code name (for example, A001, A002, A003 ... A00n). The algorithm operates first in alphabetical order and then randomly, in order to ensure that the order in which the files are selected does not affect the result. When the documents are selected in alphabetical order, the algorithm analyzes the first file (for example, A001) and for this the tokens (To) and types (Ty), and the corresponding lexical density are calculated. With this, a point is already obtained in the graphic representation that is intended to be extracted. Then, following the same selection criteria as in the first one, the following document of the corpus is taken (for example, A002) and the To and Ty are calculated again for it, but adding the results to the Ty and To of the previous iteration (in this case those of the first document analyzed), the lexical density is calculated and with this a second point for the graphic representation is obtained. This algorithm is followed until all the documents that make up the corpus being studied have been treated, which in this example would be the A00n. The second phase of the analysis takes the documents in random order, for example the first A003, then the A00n, and so on until all the documents of the corpus have been analyzed.

Éste es el mismo algoritmo para el análisis de n-gramas, esto es, la opción de realizar un análisis de la frecuencia de aparición de secuencias de palabras (1-grama, 2-grama, ..., n-grama). La aplicación ofrece la posibilidad de hacer el cómputo de estas secuencias considerando un rango de longitudes de secuencia (números de palabras) definido por el usuario. Al igual que se realiza con palabras independientes (tokens), se muestra un gráfico con la información de representatividad del corpus tanto para un orden aleatorio de los ficheros como para un orden alfabético por el nombre de éstos. En el eje horizontal se mantiene el número de ficheros consultados, y en el eje vertical el cociente (número de n-gramas distintos) / (número de n-gramas totales). A estos efectos, un n-grama es considerado como un token. Así mismo, los ficheros de salida generados indican los n-gramas.This is the same algorithm for the analysis of n-grams, that is, the option to perform an analysis of the frequency of occurrence of word sequences (1-gram, 2-gram, ..., n-gram). The application offers the possibility of computing these sequences considering a range of sequence lengths (number of words) defined by the user. As is done with independent words ( tokens ), a graphic with the representative information of the corpus is shown both for a random order of the files and for an alphabetical order by their name. On the horizontal axis the number of files consulted is maintained, and on the vertical axis the ratio (number of different n-grams) / (number of total n-grams). For these purposes, an n-gram is considered a token . Likewise, the generated output files indicate the n-grams.

Tanto en el análisis por orden alfabético como en el aleatorio llega un momento en el que un determinado documento no aporta apenas types al corpus, lo cual indica que se ha llegado a un tamaño adecuado, es decir, que el corpus analizado ya se puede considerar una muestra representativa de la población en términos estadísticos. En una representación gráfica estaríamos en el punto en el que las líneas de types y tokens se estabilizan y se aproximan al cero (Figura 2).Both in the analysis in alphabetical order and in the random one there comes a time when a certain document does not contribute just types to the corpus, which indicates that it has reached an adequate size, that is, that the analyzed corpus can already be considered a representative sample of the population in statistical terms. In a graphic representation we would be at the point where the lines of types and tokens stabilize and approach zero (Figure 2).

Si el corpus es realmente representativo la gráfica tiende a descender exponencialmente porque los tokens (To) crecen en cada iteración mucho más que los types (Ty), debido a que, en teoría, cada vez van apareciendo menos palabras nuevas que no están almacenadas en las estructuras de datos que utiliza el programa. Así pues, podremos afirmar que el corpus es representativo cuando la gráfica es constante en valores cercanos a cero, pues, en la práctica, es imposible alcanzar la incorporación de cero types en el corpus ya que los documentos siempre van a contener variables del tipo números, nombres propios, etc.If the corpus is really representative, the graph tends to descend exponentially because the tokens (To) grow in each iteration much more than the types (Ty), because, in theory, fewer new words are appearing that are not stored in the data structures used by the program. Thus, we can say that the corpus is representative when the graph is constant at values close to zero, because, in practice, it is impossible to achieve the incorporation of zero types in the corpus since the documents will always contain variables of the type number , proper names, etc.

Si un corpus produce esta representación gráfica podemos afirmar que es representativo y que nos basta con X archivos (los que correspondan al punto del eje horizontal donde la gráfica se estabiliza en torno a cero). De este modo habremos identificado el tamaño mínimo de la colección, a partir del cual puede considerarse representativa.If a corpus produces this graphic representation we can affirm that it is representative and that X is enough for us files (those that correspond to the point of the horizontal axis where the graph stabilizes around zero). In this way we will have identified the minimum size of the collection, from which It can be considered representative.

       \newpage\ newpage

A continuación se presenta el código fuente del algoritmo N-Cor:Below is the source code of the N-Cor algorithm:

33

55

       \newpage\ newpage

Implementation Proposal

El método y el algoritmo N-Cor descritos más arriba se han implementado en ordenador mediante la aplicación ReCor. La herramienta utilizada ha sido Java 2 SDK, Standard Edition (J2SE), más la librería Java para gráficas y diagramas JFreeChart. Como editor y compilador de Java se ha recurrido al entorno JCreator Pro. Como ya se ha expuesto, el sistema no se encuentra estructurado en paquetes, y posee las siguientes clases principales:The N-Cor method and algorithm described above have been implemented in computer using the ReCor application. The tool used has been Java 2 SDK, Standard Edition (J2SE), plus the Java library for graphics and JFreeChart diagrams. As editor and compiler of Java has been resorted to the JCreator Pro environment. As already stated, the system is not structured in packages, and has the following main classes:

- Palabras: Algoritmo de cómputo, lectura y escritura a archivo.- Words: Computation algorithm, reading and writing to file

- Gui: Interfaz de usuario.- Gui: User interface.

- VentanaGrafica: Adaptador para la representación gráfica.- VentanaGrafica: Adapter for graphic representation.

En este apartado nos ocupamos del diseño UML (lenguaje unificado de modelado), a la par que abordamos las principales clases creadas para la aplicación y el software de desarrollo elegido.In this section we deal with UML design (unified modeling language), while addressing the main classes created for the application and software of chosen development.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

Class Diagram

En este apartado se abordan las variables y los métodos de cada clase creada para la aplicación; así mismo, se tratan las interacciones de las distintas clases entre sí.In this section the variables and the methods of each class created for the application; also be they treat the interactions of different classes with each other.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

1. Graphic Window Class

Esta clase (Figura 3) es la responsable de presentar en pantalla la representación gráfica del corpus. Para la implementación de esta clase se han utilizado métodos de la librería JFreeChart. Extiende la clase java llamada Frame.This class (Figure 3) is responsible for present on screen the graphic representation of the corpus. For the implementation of this class library methods have been used JFreeChart Extends the java class called Frame.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

2. Order Frequency class

Esta clase (Figura 4) es necesaria para ordenar las distintas palabras que aparecen en todo el corpus dependiendo del número total de apariciones que tengan. Esto es fundamental para la creación de uno de los archivos de salida. Esta clase define en método compareTo de la interfaz Comparable.This class (Figure 4) is necessary to order the different words that appear throughout the corpus depending on the total number of occurrences they have. This is essential for the creation of one of the output files. This class defines in the compareTo method of the Comparable interface.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

3. Gui class

Esta clase (Figura 5) crea la interfaz gráfica de usuario y sirve para definir los botones, etiquetas, listas desplegables, checks, etc. Extiende la clase java llamada JFrame.This class (Figure 5) creates the graphical user interface and serves to define the buttons, labels, drop-down lists, checks , etc. Extends the java class called JFrame.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

4. Word class

Es la clase más importante; en ella (Figura 6) se analiza el corpus (por ejemplo, se almacenan los pares [palabra, nº de apariciones] en una tabla Hash), se crean los archivos de salida y se calculan los diferentes puntos de las funciones para la representación gráfica.It is the most important class; in it (Figure 6) the corpus is analyzed (for example, the pairs [word, number of occurrences] in a Hash table), the files of output and the different points of the functions are calculated for the graphic representation.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

5. Controller Class

En esta clase (Figura 7) se especifica la acción asociada a cada evento que suceda en el interfaz gráfico de usuario. Esta clase define el método actionPerformed de la interfaz ActionListener.In this class (Figure 7) the action associated with each event that occurs in the graphical user interface is specified. This class defines the actionPerformed method of the ActionListener interface.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

6. Route class

Esta clase (Figura 8), que tiene el método main, lo prepara todo para iniciar la ejecución de la aplicación y crea un objeto de la clase Gui (afectado por la clase Controlador).This class (Figure 8), which has the main method, it prepares everything to start the execution of the application and creates an object of the Gui class (affected by the Controller class).

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

specs

Dado el conjunto de documentos que componen el corpus, el programa extrae información de éste en varios ficheros, además de las dos gráficas, que sirven para el estudio del corpus seleccionado.Given the set of documents that make up the corpus, the program extracts information from it in several files, in addition to the two graphs, which are used to study the corpus selected.

Input data

a)to): Archivos del corpus: Selección del conjunto de archivos que forman el (sub)corpus.Corpus files: Selection of set of files that make up the (sub) corpus.

b)b): Fichero filtro de palabras: Archivo de entrada, en el cuál están recogidas las distintas palabras que no se desea analizar, es decir aquí están todas las palabras que se quiere filtrar (separadas por espacio, coma, punto, punto y coma, dos puntos o salto de línea).File word filter: File entry, in which are collected the different words that are not you want to analyze, that is here are all the words you want filter (separated by space, comma, period, semicolon, two points or line break).

c)C): Elección de parámetros: Elección del tamaño del grupo de palabras (1, 2, ..., 10), ó n-gramas. Para la selección de cada grupo, se utiliza el método de ventana deslizante, sin tener en cuenta contextos de comas, puntos, párrafos, etc. También se puede elegir si para el corpus que se va a analizar se desea filtrar o no los números.Choice of parameters: Choice of word group size (1, 2, ..., 10), or n-grams For the selection of each group, it use the sliding window method, regardless comma contexts, points, paragraphs, etc. You can also choose if for the corpus to be analyzed you want to filter or not numbers.

Output data

Los datos de salida comprenden una representación gráfica del corpus. En el eje horizontal se representa el nº de archivos seleccionados hasta ese momento y en el eje vertical el cociente types/tokens. Han sido representadas dos funciones, una para los archivos ordenados por nombre, y otra para los archivos elegidos aleatoriamente. Ambas funciones tienden a ir descendiendo exponencialmente mientras más documentos tomamos. Cuando las funciones se estabilizan, se puede afirmar que el corpus es representativo y se puede determinar aproximadamente a partir de qué número de documentos se produce esto.The output data comprises a graphic representation of the corpus. The number of files selected up to that moment is represented on the horizontal axis and the types / tokens ratio on the vertical axis. Two functions have been represented, one for files sorted by name, and another for files randomly chosen. Both functions tend to decrease exponentially the more documents we take. When the functions are stabilized, it can be affirmed that the corpus is representative and it can be determined approximately from what number of documents this is produced.

Graphic representation

La representación gráfica es la que permite al usuario decidir si un corpus es representativo o no. Así, existen dos representaciones gráficas interesantes: la primera -representación gráfica A- es la que coloca el eje vertical Ty/To y en el eje horizontal To; la segunda -representación gráfica B- sitúa en el eje vertical los Ty/To y en el eje horizontal el número de archivos analizados hasta ese momento. La primera representación gráfica nos indicaría el número mínimo de palabras con el que debe contar la colección, mientras que la segunda especifica el número de documentos o textos.The graphic representation is what allows the user to decide whether a corpus is representative or not. Thus, there are two interesting graphical representations: the first - graphical representation A - is the one that places the vertical axis Ty / To and on the horizontal axis To; the second - graphical representation B - places the Ty / To on the vertical axis and on the horizontal axis the number of files analyzed up to that point. The first graphic representation would indicate the minimum number of words that the collection should have, while the second one specifies the number of documents or texts.

Se puede decir que el corpus es representativo cuando la gráfica se estabilice en torno a valores cero, y tienda a descender exponencialmente porque los tokens (To) crecerán en cada iteración mucho más que los types (Ty), debido a que en teoría cada vez irán apareciendo menos palabras nuevas conforme se vaya analizando la densidad léxica del subconjunto incremental de documentos analizados.It can be said that the corpus is representative when the graph stabilizes around zero values, and tends to descend exponentially because the tokens (To) will grow in each iteration much more than the types (Ty), because in theory every time fewer new words will appear as the lexical density of the incremental subset of documents analyzed is analyzed.

Output files

Además de la representación gráfica, también se extrae información del corpus en varios archivos de salida:In addition to the graphic representation, it is also extract corpus information in several output files:

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

a) Output file No. 1

Este archivo contiene cinco columnas:This file contains five columns:

\bullet?: Ty: Muestra los types (palabras distintas) hasta ese momento.Ty: Shows the types (different words) until that moment.

\bullet?: To: Muestra los tokens (número de palabras en total) hasta ese momento.To: Shows the tokens (number of words in total) until that moment.

\bullet?: Ty/To: cociente entre los types y los tokens. Ty / To: quotient between types and tokens.

\bullet?: V1: El nº de palabras con tan sólo una aparición hasta ese momento. V1: The number of words with so Only one appearance so far.

\bullet?: V2: El nº de palabras con tan sólo dos apariciones hasta ese momento. V2: The number of words with so Only two appearances so far.

Muestra los resultados de dos análisis distintos, uno para los archivos ordenados alfabéticamente por nombre y otro para los archivos ordenados en orden aleatorio. Para cada uno de estos análisis habrá tantas líneas como archivos tenga el corpus seleccionado. El número columnas serán cinco (las arriba descritas).Show the results of two analyzes different, one for files sorted alphabetically by name and other for files sorted in random order. For each of these analyzes will be as many lines as files have The selected corpus. The number columns will be five (those above described).

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

b) Output file nº 2

Este archivo se compone de dos columnas:This file is made up of two columns:

\bullet?: Palabra: Aquí se recogen todas las palabras distintas que forman el corpus (ordenadas alfabéticamente por el nombre del documento). Word: All are collected here the different words that make up the corpus (ordered alphabetically by the name of the document).

\bullet?: Apariciones: Se refiere al número de apariciones en el corpus de la palabra en cuestión. Appearances: Refers to number of occurrences in the word corpus in question.

c) Output file nº 3

Este archivo se compone de dos columnas:This file is made up of two columns:

\bullet?: Palabra: Aquí se recogen todas las palabras distintas que forman el corpus (ordenadas por número de apariciones). Word: All are collected here the different words that make up the corpus (sorted by number of appearances).

Claims

1. Method implemented in a computer to determine the representativeness of a corpus by executing a program characterized by:

\bullet?: Es independiente de la lengua o tipo textual de la colección de documentos analizados, It is language independent or textual type of the collection of documents analyzed,

\bullet?: Establece el umbral mínimo de representatividad a través de un algoritmo (N-Cor) de análisis de la densidad léxica en función del aumento incremental del corpus, Set the minimum threshold of representativeness through an algorithm (N-Cor) of lexical density analysis as a function of the increase incremental corpus,

\bullet?: Comprende datos de entrada, datos de salida, representación gráfica, y archivos de salida, Understand input data, output data, graphic representation, and files exit,

\bullet?: Comprende el análisis gradual de todos los archivos que componen el corpus, extrayendo información sobre la frecuencia de las palabras tipo (types) y las ocurrencias o palabras distintas (tokens) de cada archivo del corpus.It includes the gradual analysis of all the files that make up the corpus, extracting information on the frequency of such words (types) and various occurrences or words (tokens) of each file corpus.

\bullet?: Cada archivo que integra el corpus debe estar identificado de forma unívoca mediante un nombre en código alfanumérico; procediéndose primero a un análisis por orden alfabético y después a un análisis aleatorio; calculándose en cada caso y para cada documento los tokens, los types, y la densidad léxica correspondiente; lo que permite obtener finalmente una representación gráfica indicativa de la representatividad del corpus analizado.Each file that integrates the corpus must be uniquely identified by an alphanumeric code name; proceeding first to an analysis in alphabetical order and then to a random analysis; calculating in each case and for each document the tokens , types , and corresponding lexical density; which finally allows to obtain a graphic representation indicative of the representativeness of the analyzed corpus.

2. Computer-based method of determining the representativeness of a corpus by executing a program according to the preceding claim characterized in that, based on said N-Cor algorithm, it is possible to perform an analysis of the frequency of occurrence of word sequences, being able to the computation of said sequences be made considering a range of lengths defined by the user.

3. A computer-based method of determining the representativeness of a corpus by executing a program according to any of the preceding claims, characterized in that said computer application (ReCor) has been developed in Java 2 SDK Standard Edition using the Java library for graphics and JFreeChart diagrams; and using the JCreator Pro environment as a Java editor and compiler.

4. A computer-based method of determining the representativeness of a corpus by executing a program according to the preceding claim characterized in that it comprises the following classes:

\bullet?: Clase VentanaGrafica, cuya función es presentar en pantalla la representación gráfica del corpus y su implementación se realiza usando métodos de la librería JFreeChart; Graphic Window Class, whose function is to present on screen the graphic representation of corpus and its implementation is done using library methods JFreeChart;

\bullet?: Clase Orden Frecuencia, implicada en la ordenación de las palabras que aparecen en el corpus en función del número total de aparición de las mismas; Class Order Frequency, involved in the ordering of the words that appear in the corpus depending on the total number of occurrence of the themselves;

\bullet?: Clase Gui, cuya función es crear la interfaz gráfica de usuario; Gui class, whose function is create the graphical user interface;

\bullet?: Clase Palabra, implicada en el análisis del corpus, la creación de los archivos de salida, y el cálculo de los diferentes puntos de las funciones para la representación gráfica. Word class, involved in the Corpus analysis, creation of output files, and the calculation of the different points of the functions for the graphic representation.

\bullet?: Clase Controlador, cuya función es especificar especifica la acción asociada a un evento en el interfaz gráfico de usuario; Controller Class, whose function is specify specifies the action associated with an event in the graphical user interface;

\bullet?: Clase Ruta, que permite iniciar la ejecución de la aplicación y crea un objeto de la clase Gui. Route class, which allows you to start running the application and creates an object of the class Gui.

5. Electronic device programmed to determine the representativeness of a corpus according to any of the preceding claims characterized in that it allows establishing the minimum threshold of representativeness through the N-Cor algorithm by means of the execution of the ReCor computer application.