WO2013179317A1 - A method for generating a graphical user interface for the optimization of a research on databases - Google Patents

A method for generating a graphical user interface for the optimization of a research on databases Download PDF

Info

Publication number
WO2013179317A1
WO2013179317A1 PCT/IT2012/000160 IT2012000160W WO2013179317A1 WO 2013179317 A1 WO2013179317 A1 WO 2013179317A1 IT 2012000160 W IT2012000160 W IT 2012000160W WO 2013179317 A1 WO2013179317 A1 WO 2013179317A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
term
function
terms
nodes
Prior art date
Application number
PCT/IT2012/000160
Other languages
French (fr)
Inventor
Giuseppe NADDEO
Original Assignee
Naddeo Giuseppe
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naddeo Giuseppe filed Critical Naddeo Giuseppe
Priority to PCT/IT2012/000160 priority Critical patent/WO2013179317A1/en
Publication of WO2013179317A1 publication Critical patent/WO2013179317A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor

Definitions

  • the present invention refers to the technical field relative to the realization and display of databases.
  • the invention refers to an innovative method for the realization of a graphical user interface that identifies a database, in such a way as to allow to make an easy search aimed at the data contained in it, as for example CV, articles, musical pieces, stories and text contents in general.
  • the result of the search provides a first element of potential interest and a second element, surely of less interest for the person who makes the search.
  • the search is limited at the moment, therefore, to the insertion of one or more key words and the software extracts the resulting data without any kind of filter. It is therefore clear that the current search systems do not allow in any way to be able to make pointed and specific searches .
  • the search is in fact a "simple extrapolation of a pre-determined number of data memorized in the database.
  • the person that searches a particular content is therefore obliged to read all the texts extracted and to select, among all of them, those that are the most related to the search according to him.
  • Equivalent search systems are instead structured using a cataloguing of the data according to categories and subcategories. In this case, therefore, the possibilities of extrapolation of data of interest increase. On the other hand, however, the management of the taxonomy of the categories requires huge efforts both in the definition and the management and update.
  • the feed-in of said data by a user is currently static.
  • the user arranges an assembly of data, generally in electronic format, which is loaded in the database.
  • a weight function f (Ti)
  • the method through the nodes, allows an immediate display of a summary of the most relevant data filed.
  • the spatial arrangement of the nodes starting from a central node connected to peripheral nodes, allows to understand which of the nodes is to be selected in order to extrapolate the files of interest exclusively.
  • connection "government - ship” may indicate a maritime context
  • connection "government - kitchen", the restaurant industry; the connection "government - nation”, the field of politics.
  • the weight function (f (Ti) ) is implemented by the server (10) in such a way as to consider at least one of said characteristics:
  • the pair function ( FC ) (but it will be explained more in detail below) associates a weight or importance to each pair of terms previously processed by the weight function.
  • the system through the implementation of one or more of said functions, is therefore capable of generating a relative spatial graph also in the case in which, for example, no user has highlighted any term or has made any search.
  • the system is in fact capable of realizing a classification of the most recurring terms in the text, through the implementation of the appropriate function, and therefore of realizing a graph simply on the basis of the diffusion of the terms in the text.
  • this diffusion function can then be associated to a further function that considers the frequency of highlighting of a term by the user within a loaded text and/or the frequency of search and selection in a node.
  • both the frequency of highlighting and the frequency of selection of a term in a node when a user makes a search, allow an automatic update of the graph which can therefore be modified automatically by moving a node that was first peripheral towards the centre. For example, if through the pre-set function that counts the selections it is verified that a peripheral node is actually the one that results the most selected by the users to make a search, then such a function allows a modification of the graph that moves the peripheral node, rendering it central.
  • Such a method therefore considers interactions of the user both for the preferences indicated and for the terms searched and, on the basis of this feedback, re-defines the display, updating the graph continuously.
  • FC pair function
  • the user can therefore see a graphic that shows key terms that for him can be of interest in the search.
  • Such key terms, contained in each node, are then arranged in the spatial structure in order of importance from the central one towards the external ones.
  • the user can in this way click the nodes of interest that carry the words searched by him and obtain, like a direct connection, the extrapolation of all the files that contain those terms.
  • the direct connection between the nodes highlights the correlation of context between the contents of the nodes in a visual and immediate manner.
  • FIG. 1 shows a screen for the insertion of personal data to have access to the system
  • FIG. 2 shows the load phase of a text file, for example a CV
  • FIG. 3 shows the highlighting phase of the terms considered important by the user that enters his own file
  • FIG. 7 shows a central server connected to the network that allows the access to said system by external users that want to make use of it;
  • FIG. 8 shows the result of a search as a consequence of the selection of a certain node
  • FIG. 13 is a matrix of the pair function in accordance with the second variant.
  • FIG. 14 shows a screen through which it is possible to filter the search
  • FIG. 15 shows the screen that allows to provide the user with a summary chart
  • FIG. 16 shows a macroscopic flow chart that leads to the generation of the graph in accordance with the invention
  • FIG. 20 shows a simplification example of the graph by reducing the number of connections between the nodes through the application of a known algorithm.
  • a system that allows to file and process data in a dynamic way is foreseen, allowing a display of such data in a graphic and immediate manner through a node-structure graph.
  • a common window 1 which allows a user to make his own registration through the entry of one or more personal data. All the requested fields are therefore filled in, such as name and surname, for the individuation and cataloguing of the subject.
  • the page is generated by a software that can be implemented within a server.
  • the server can be of the central type, or connected to the Internet. In this way, the system generated results accessible by different users placed at a distance one from the other.
  • the term "users" means not only those people who want to enter a text file (for example a CV) in order to optimize the highlighting of their credentials, but also those people who want to consult this database.
  • Each user can therefore make a connection to the Internet from his own PC and have access to the system which, in the case of data entry in general, requires an initialization as per figure 1.
  • the system implements the possibility of loading one's own CV (or any other type of electronic text file) or to the same page or to a dedicated page, as for example shown in figure 2.
  • Figure 2 therefore shows a page 2 that highlights a screen through which it is possible to load, in a known way, an electronic file, for example the CV.
  • the window therefore allows to access to the system file of one's own personal computer on which the programme runs and load the desired file.
  • the file loaded is then processed by the system in order to contribute to the construction and/or the update of a graph as the one in figure 5, wherein the display of nodes is included, starting from a central node that is connected to peripheral nodes, and wherein each node highlights a functional term indicative for the person who wants to make a pointed search.
  • the graph generated is not static but dynamic, in the sense that it is modified continuously on the basis of:
  • the first aspect which is described right below, refers precisely to a function that allows a user to highlight, when entering a text, one or more terms considered relevant by him. Said terms have therefore a specific importance in the generation of the graph, as clarified below. For example, in case of loading a CV, terms considered relevant may be highlighted for the purpose of underlining particular features.
  • the system generates a screen that visualizes the content of the file loaded and memorized by the server 10.
  • the user has the possibility of indicating to which terms the server must give greater importance in the process, that is which are the terms that should be memorized.
  • the server 10 memorizes all the terms of the text and, obviously, memorizes also the terms that have been selected by the user. The words highlighted are therefore memorized together with the rest of the text by the server.
  • the system in the process, takes into consideration all the words, but will give a greater importance to the ones selected by the user in the creation and update of the overall graph of figure 5, as described in detail below.
  • the graphical user interface controllable, for example, through the mouse or the keyboard of the PC, has the shape of a marker or highlighter, for example (see the marker in figure .3) but any other shape could be used.
  • the graphical user interface is therefore in fact a normal input that indicates to the system how certain words must be memorized to be then re-processed in accordance with a specific function described below.
  • Figure 4 shows, just as a way of example, a chart in which six users are present, who have entered their' files (for example, their CV) in the programme.
  • the number of six users is not at all limiting and is here indicated just as a way of example.
  • the chart of figure 4 for example, extrapolates parts of the text contained in the file in order to make understand the operational logic of the system.
  • the chart shows, for each user, a part of CV loaded by the user and the words highlighted by him.
  • the system is capable of memorizing all the words by distinguishing those highlighted, and therefore marked by the user with the marker.
  • Figure 4 shows, through the chart, how the first user (User 1) has highlighted the word "programmer” and "java".
  • the chart shows in fact in the relative column "CV" a part of the text just as a way of example and, in the column besides, the words highlighted. The same has done the user number two.
  • the user number three and the user number four, instead of the word "java” have underlined the word "net”.
  • the user number five and the user number six instead, have highlighted, in addition to the word programmer, the word "javascript”.
  • the system is capable of organizing and creating a visual structure appropriately organized that takes into consideration said highlightings through an appropriate function.
  • the aspect of the highlighting is just one of the elements that are implemented, with an appropriate function, in order to optimize the creation of the graph of figure 5.
  • the system through other functions, is able to take into consideration also the frequency of search of a term (therefore selection of a node at the moment of the search) and of the diffusion of one term in the loaded texts .
  • the system moreover, keeps track of the presence/ absence of all the possible pairs of the relevant terms, that is the terms that result from the process of the preceding functions.
  • the definition of such pairs and of the relative importance allows then to construct the graph. This allows not only the realization of a functional graph, but also a continuous update of it on the basis of the searches made (selection of nodes by the users) , on the basis of the load of new text files with relative eventual highlighting of terms and on the basis of the diffusion of terms in the text itself.
  • the system therefore implements these functions, obtaining at the end a graphic structure 5, as per figure 5, which puts at the centre the most recurring and/or searched and/or highlighted word and which is connected to the other more peripheral words.
  • the graph of figure 5 translates what is shown in the chart of figure .
  • the central node- 6 that is generated therefore, sees the term "programmer” because this is the most recurring word and the most highlighted.
  • the calculation functions therefore take into consideration the highlighting and the diffusion of the same word.
  • the word programmer is highlighted in all the texts and present in all the texts. The result of the functions is therefore that by which the term programmer is placed at the centre of the graph.
  • the central node is connected directly to some peripheral nodes, which in this case are "java", “javascript” and ".net”. This is clear from the chart 4, for said terms are always highlighted but less diffused.
  • the other nodes are still more peripheral because they are not highlighted and are present only in some loaded files, not in all of them, and are therefore less diffused. These are therefore connected with the nodes with which they have direct correspondence on the basis of the presence/absence of the terms in the same file.
  • the final graph is therefore a visual representation of the data entered, highlighted and searched.
  • Each node of the graph is a word to which the system has associated a particular importance or weight through the diffusion, the highlighting of it and the number of times that it has been searched.
  • the analytic system that allows this graphic generation is the following.
  • the fundamental concept is precisely that of statistically generating a function, called "Weight" function, which is capable of analyzing all the terms contained in a loaded text, in order to construct the node structure containing an indicative term functional to the search and connected to the texts that contain said term.
  • the pair function instead, defines the direct connections between the nodes, both in the initial creation of the graph and in its subsequent updates, in such a way as to create a correlation of context in a visual way.
  • said "Weight" function is based not only on the frequency with which a term has been underlined, but also on the frequency with which this term has been searched and, above all, how much diffused is in the text. This makes that the graph generated of figure 5 can be updated continuously not only through the introduction of new files by the users, but also through a check in real time of the most searched terms by those who search a type of profile.
  • f(Ti) Tf(Ti) x [log(lnvTd(Ti))+l] x (IndTag(Ti)-hl) x (tndSearch(Ti)+l)
  • the first parameter is an "Search index function", said “IndSearch” in the weight formula “f (Ti) " indicated above, which indicates the frequency with which a user, which consults said database, searches in the texts a certain term.
  • This parameter in the end, counts and memorizes the number of times a specific node containing a specific term, for example the term "programmer", has been selected for the search.
  • such a "Search index" function allows to update a graph on the basis of the number of times a term in a node is selected, said function contributing to modify the value of the Weight function every time a search is made.
  • the weight function f(Ti), responsible for the generation of the graph of figure 5, is updated continuously on the basis of each selection of the node that is made.
  • a second parameter is the "Frequency of diffusion" of the term within the archive given the relation of the number of occurrences of the term Ti (Ti is the generic ith term of the archive) divided by the number of terms present in all the documents of the archive (NT) . number of occurrences of the term Ti
  • This second parameter allows to take into consideration, therefore, in the generation of the graph, the number of times a term is present in the assembly of terms of the archive.
  • This function of diffusion Tf(Ti) precisely, memorizes the number of times each term is present in all the documents loaded in the server 10 and memorizes such a result for each term (Ti) .
  • the diffusion of a term is therefore another essential parameter for constructing the graph since a too much diffused term, for example "programmer", is indicative of an interest for the person who has entered the text file.
  • a third parameter is defined by the function "InvTd(Ti)".
  • This third parameter is conceptually correlated to the preceding one in the sense that the preceding one calculates the diffusion of a term to assign to it a value of importance, while the present one calculates always a diffusion in order to understand if said terms can be discarded because too much diffused and therefore generic.
  • Td(ti) it is necessary to define first the parameter Td(ti) from which it depends and given by the underlying formula (ti is always the ith generic term) .
  • the underlying formula represents the frequency of one single term in all the documents loaded (ND is the total number of documents loaded in the server) .
  • InvTd is the inverse of the formula indicated above, that is:
  • the logarithm of such a value is implemented.
  • This logarithmic function is extremely important in the assembly of the weight function since it serves to determine how much a term is diffused/rare within the entire archive.
  • a term that is extremely diffused contextually in all the loaded files would have a high td(Ti), therefore a invTd value close to 1 and the log of such a value close to zero.
  • This means that the overall weight function does not vary on the basis of a word that is substantially common in all the archive files and therefore that is not distinctive of its own text file.
  • a further parameter is "IndTag”, which is simply the index that counts the number of times the term has been tagged, that is highlighted by the user. Said index is updated every time the user selects one or more terms and memorizes for each term ith if it has been highlighted by the user.
  • the weight function introduced above is therefore calculated through the following final formula, as also shown above, and that contains the essential parameters introduced above.
  • the formula therefore, takes into consideration both the search index (IndSearch) but also the highlighting index of the term (IndTag) , of the diffusion of a term in the text (Tf(Ti) and of the diffusion of the same term in all the texts (log (InvTd(Ti) ) .
  • the same function is not influenced by a common diffusion of the same term in the archive through precisely the logarithmic function of the inverse of Td(Ti) "InvTd(Ti)".
  • weight function which creates a classification
  • a minimum threshold of terms for example, the first ten or the first hundred of the classification, with the consequent discard of all the remaining ones. This allows to realize non dispersive graphs.
  • the minimum threshold of terms to take into consideration can be set by "default” or, preferably, set freely by the user through the cursor 60 of figure 14.
  • the user can use such a horizontal cursor on the interface, making it slide in a direction or in the opposite one, and on the basis of this, visualize a graph composed of a pre-determined number of relevant terms, discarding those placed at the end of the classification realized by the weight function.
  • the horizontal cursor gives the density of the graph.
  • the term engineer is a feature that the user will probably highlight as an aspect of interest.
  • Tf(Ti) The aim of the Tf(Ti) function is that of giving a certain importance to the term “engineer”, while the inverse of a logarithmic function Log (Inv (Td (Ti) ) must "discard” the term “address”, being it absolutely common and not relevant to the search.
  • Tf(Ti) is instead 1 both for the term “engineer” and for the term “address”.
  • the system constructs a graph with the nodes "address” and "engineer” correlated between them because they have the same importance. Subsequently, as the system is used through searches or new loaded files, the highlights and/or the searches of the users will confer greater importance to a term or to another one, allowing a modification of the graph.
  • the final weight is therefore implemented in such a way as to generate and modify through time in a dynamic manner the graph as per figure 5.
  • the weight function can be implemented according to a second formula as indicated below:
  • Tk f(Tk) ⁇ (IndTag(Tk)-l) x (lndSearch(Tk)+l) wherein, however, the term Tk that is implemented is that which satisfies the following condition:
  • the denominator of the fraction is increased by 1 to avoid the division by zero.
  • temp(Ti) values can be those close to 1 or equal to 1.
  • Other values could be comprised between 0,7 and 1,3 or between 0,8 and 1,2.
  • the values indicated above, both the range and the specific values, are indicative of the calculation of a weight function just for relevant terms.
  • the IndTag and IndSearch indexes are increased by one because initially these values can be equal to zero in the case in which the term examined has not been tagged or searched yet.
  • This function is therefore calculated subsequently to when the weight function has been implemented and, thus, has been created a first index of importance of the terms contained in the texts.
  • IndexMAX Index 1 > Index2 > Index3 > Index5 > Index5 > ...IndexN and wherein the weight functions result the following:
  • the pair function must now realize the connections between a node and another one in such a way as to generate the complete correlation.
  • the pair function which connects a node to another one, therefore verifies the coexistence of the terms of great importance within the texts.
  • a square matrix is constructed where the value of the cell (i, j) indicates in how many documents both the term Ti and the term Tj (see figure 6) are present.
  • the system therefore implements the construction of the graph on said terms.
  • said square matrix is constructed, wherein the lines and the columns show the terms with a certain importance.
  • the matrix will have a shape as the one shown below:
  • the software places in the cells of the matrix the number of files that contain the line-column connection.
  • just one document contains the pair "programmer-j ava” and a single document contains the pair "programmer-php” .
  • the matrix is calculated for all the terms for which there is a weight value for which it is necessary to understand which to eventually omit.
  • the system puts in order the pairs Ti, Tj according to weight values calculated as initially indicated.
  • the pair of terms are redundant.
  • the graph is constituted of three nodes (a, b, c) , as per figure 20, and that they are all three connected a-b, b-c, and c-a.
  • the arc a-b has a value of relevance of 50, b-c of 30 and c-a of 20.
  • the pair c-a is redundant and can be eliminated.
  • An algorithm that is well-known in computing as "minimum spanning tree" can then be used and for example, can be found on the following Internet site: http: //it . wikipedia . org/wiki/Albero ricoprente .
  • the pair function could be calculated instead by constructing a matrix (N+l) x M, as per figure 13, where N+l is the number of terms and M is the number of documents.
  • the value of the cell i,j indicates the frequency of the term Ti in the document
  • the system puts in order the pairs Term - i, Term - j according to the values f (Ti, Tj ) calculated in the preceding point.
  • the system is capable of obtaining each time a graphic as that of figure 5.
  • the degree is therefore constructed in the following way.
  • the server contains a certain number of documents and, for each document, the user has selected, that is highlighted, a certain number of terms of primary importance .
  • the server executes at this point one of the two weight functions described and obtains the list, that is the classification, of the terms of greater importance.
  • the server implements also one of the two pair functions described above for realizing the connections .
  • a company that searches a professional figure can select only those headings of its interest and obtain a number of profiles already creamed off from the start. For example, those who search a "java” programmer and "portlet” will be able to obtain only the requested profiles (which include the characteristics of "programmer” - “java” - “portlet” together) , by selecting both the icon “portlet” and the "java” one.
  • the selection of the nodes allows a direct connection to the files loaded and connected to them, therefore filtered.
  • the graphic scheme allows to those who want to use the database to visualize immediately the data that could be of their interest.
  • the system can without problems be centralized through the Internet, as shown in the scheme of figure 7, allowing the user to enter his own data comfortably from his personal computer and, on the other hand, allowing the consulting of them to any interested person.
  • a central server 10 can be foreseen, which implements the functionality through an appropriate software.
  • the server is connected to the Internet in such a way as to be accessible by one or more users provided with an appropriate PC or electronic device that can have an Internet connection. Once the connection to the server is made, the pages described in figures 1, 2, 3 and 5 are opened in succession, thus allowing to enter data or to consult a database on the basis of one's own needs.
  • always figure 7 therefore shows, on the one hand, a user that enters his own data through his own electronic processor 20 and, contextually, the display of the system by a third user (for example a company) is shown, who searches a pre-determined profile and both connected to the server 10.
  • a third user for example a company
  • Figure 8 therefore highlights a possible screen in which all the results obtained by the selection of a certain node are highlighted and therefore where it is possible to download the files requested.
  • the first line highlights the texts that contain the term "jQuery", that is those texts in which the term "jQuery” is present and the same term has been contextually highlighted by the users.
  • the system suggests such texts among the first results because the user searches "jQuery" and the candidate has underlined precisely "jQuery".
  • the second line shows texts that contain "jQuery” which, however, has not been highlighted by the candidate that entered the CV.
  • the system considers such results of second level because the user searches "jQuery" but the candidate has not indicated it as preference.
  • the system allows then to browse the map of the nodes and to widen the view of an area of interest, as per figure 9.
  • the system also allows to re-centre a word with respect to the entire graph.
  • the system uses the same principle described, but creating a tree with the new centred word.
  • figure 10 shows a diagram with the central node of "programmer". The user can request to centre "java” .
  • the new node is only considered as starting point of the process.
  • the graph has processed an assembly of technical CV in which the central node is "engineering” and the peripheral ones “computing", “building”, “electronics”, “civil”, “spatial”, etc.
  • the user that searches can be interested only in civil engineers, and not in the others. In that case, the user can select the node "civil” and request the system to exploit it.
  • the system can exploit the graph "underlying” and show: “building", “hydraulics", "structures",
  • the user can create a search filter, as shown in figure 14, which displays the part of the graph that satisfies the condition of filter-threshold set.
  • the filter represents the threshold indicated above in the text in order to create the graph, that is to consider a threshold both in the weight function (for example the first 10 classified) and in the pair function.
  • Figure 14 shows, as a way of example, a sliding bar 60 and a window 50 which allows the entry of key words to make the filters.
  • the filter can be made on the basis of the number of words that wants to be visualized.
  • the graph can also be filtered by using the attributes of the personal data that the user fills in when he enters the text.
  • the system can associate to the single text different types of personal data. As a way of example, but not limiting, we can consider personal data that shows an address (city, postal code, street, etc.); the system can display the graph generated considering just and only the texts for which the attribute city is "rome”.
  • Figure 15 shows a further implementation of the system that allows, to the person who has entered a file or who has consulted the database, to receive in return a summary chart .
  • the summary chart shows the graph generated with the functions described before but in this case the system has generated the graph by processing the single text taken, for which to generate the summary chart, and not the entire archive.
  • the system can implement other functions such as those of contextual help.
  • the system interacts with the user at the moment of the search and at the moment of the selection of the key words to highlight.
  • the system having already processed a graph and therefore having already created a classification of importance and pairs as described above, can suggest that the user highlights certain key words (obviously if present in the text in the load phase) that have contributed to the construction of the graph.
  • the system through an e-mail or a notice on the graphical user interface, can confirm to the user the validity of the choice Tl, but suggest also the highlighting of Til and T12 because they are present on the text and much searched by the users.
  • the system can interact with the user to help him with the search. If the user has selected Tl and the system has given too many results, the system can suggest the further selection of T12 and Til to which they are directly connected but are more specific and that will therefore give place to a more pointed search.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention concerns an innovative method the creation of an electronic archive that can be suited and comprising the operations of; Load and memorization of one or more text files (FT) in a server (10); Generation and display by the server (10) of a diagram (5) formed by one or more nodes (6) placed in the space and containing graphically, each one of them, a term (Ti) present in one or more of the loaded text files (FT) and indicative for the search to be carried out in the archive, the nodes being connected between them starting from a central node (6) until one or more peripheral nodes; And wherein said spatial generation of the diagram (5) is obtained through the implementation by the server (10) of a weight function (f(Ti)) that extrapolates an index of importance of the terms (Ti) contained in each text (FT) loaded in the server (10) to generate the nodes (6) and of a pair function (FC) that connected the generated nodes between them.

Description

A METHOD FOR GENERATING A GRAPHICAL USER INTERFACE FOR THE OPTIMIZATION OF A RESEARCH ON DATABASES
Technical field
The present invention refers to the technical field relative to the realization and display of databases.
In particular, the invention refers to an innovative method for the realization of a graphical user interface that identifies a database, in such a way as to allow to make an easy search aimed at the data contained in it, as for example CV, articles, musical pieces, stories and text contents in general.
Background art
Computerized databases that allow the collection and search of filed data, such as text files, have long been known. Such data is organized and made available to organizations that use them in order to be able to consult them and give support to their clients. Nevertheless, the current data collection system is absolutely static and flat in the sense that it is limited to a collection, in electronic format, of data received and catalogued. This implies a number of limits that will be described in an exemplifying manner right below.
The search of said files always takes place by key word. In case a search is carried out by inserting the term "java" , the result of the search could provide:
- A CV of a "java" expert, therefore a long and continuing experience;
- A CV of a person that sat an exam of "java" in the past, therefore a brief and distant experience;
It is therefore clear that in this case, in accordance with the traditional search system, the result of the search provides a first element of potential interest and a second element, surely of less interest for the person who makes the search.
Similarly, the insertion of the word "aeronautical" could give as a result the CV of a person graduated in aeronautical engineering (therefore experience related to the terms searched) or give as a result the CV of a person that lives in Aeronautics Street (therefore giving as a result a CV not related to what has been searched at all) .
The search is limited at the moment, therefore, to the insertion of one or more key words and the software extracts the resulting data without any kind of filter. It is therefore clear that the current search systems do not allow in any way to be able to make pointed and specific searches .
The search is in fact a "simple extrapolation of a pre-determined number of data memorized in the database. The person that searches a particular content is therefore obliged to read all the texts extracted and to select, among all of them, those that are the most related to the search according to him.
The result is not therefore guaranteed and is not absolutely true that, after a careful and detailed selection, the person that makes the search finds among the CV extrapolated the one that is the most congenial to him, although he has been obliged to waste time reading and studying them.
Equivalent search systems are instead structured using a cataloguing of the data according to categories and subcategories. In this case, therefore, the possibilities of extrapolation of data of interest increase. On the other hand, however, the management of the taxonomy of the categories requires huge efforts both in the definition and the management and update.
In this case, therefore, the efficiency of the system is directly proportional to the quality of the cataloguing defined.
Also the feed-in of said data by a user is currently static. The user arranges an assembly of data, generally in electronic format, which is loaded in the database. There is no way of entering data that allows to give a greater importance to certain terms, nor a system that is capable, at least visibly, of making these terms emerge.
Disclosure of invention
It is therefore the aim of the present invention to provide a data management method that solves at least in part said inconveniences.
It is therefore the aim of the present invention to provide a method that allows a search of the data through the generation of a graphical user interface which displays a summary of the data filed and, according to a graph structure, highlights in a visual way the most relevant data contained in the archive and the relations between such data.
In particular, it is the aim of the present invention to provide a realization method of a data archive in such a way as to generate a graphical user interface that allows an immediate display of a summary of the data arranged according to a tree hierarchy, highlighting in a visual way the distinctive features of the text/s contained in the archive.
It is therefore the aim of the present invention to provide as a consequence a search method of data filed that is quick, intuitive and efficient.
These and other aims are therefore obtained with the present method for the creation of an electronic archive that can be consulted and comprising the operations of: - Load and memorization of one or more text files ( FT ) on a server ( 10 ) ;
- Generation and display by the server (10) of a diagram (5) formed by one or more nodes (6) placed in the space and containing graphically, each one of them, a term indicative of some text file present in the archive, therefore functional for the search to make in the archive, the nodes being connected between them starting from a central node (6) to one or more peripheral nodes; - And wherein said spatial generation of the diagram (5) is obtained through the implementation by the server (10) of a weight function ( f (Ti) ) that extrapolates an index of importance of the terms ( Ti ) contained in each text ( FT ) loaded in the server (10) to realize the nodes, and of a pair function ( FC ) that extrapolates a connection between the nodes generated.
It is therefore clear that all the aims pre- established by the present invention have been reached.
In particular, the method, through the nodes, allows an immediate display of a summary of the most relevant data filed. In particular, the spatial arrangement of the nodes, starting from a central node connected to peripheral nodes, allows to understand which of the nodes is to be selected in order to extrapolate the files of interest exclusively.
The selection of one or more nodes, therefore, extrapolates the related text files, while on the basis of the connections shown, the user will be able to understand the field of the contents of the data filed.
For example:
the connection "government - ship" may indicate a maritime context;
the connection "government - kitchen", the restaurant industry; the connection "government - nation", the field of politics.
In this way, it is the system that offers the user the possibility of having at a glance both the most relevant terms in the archive and the relations between them. Moreover, the system offers the possibility of selecting these terms and carrying out a quick and efficient search by selecting only what is of interest.
This solution is therefore very versatile in its operation .
Advantageously, the weight function (f (Ti) ) is implemented by the server (10) in such a way as to consider at least one of said characteristics:
- the diffusion of a term within the loaded texts;
- the frequency (IndSearch (Ti) ) of selection of a node during a search carried out by a user;
- the highlighting (IndTag(Ti)) of one or more terms within a text loaded in the server (10), said operation of highlighting of the term including a display of the text file loaded and the possibility of highlighting one or more terms within it, which are memorized by the server (10).
Likewise, the pair function ( FC ) (but it will be explained more in detail below) associates a weight or importance to each pair of terms previously processed by the weight function.
In this way, the system, through the implementation of one or more of said functions, is therefore capable of generating a relative spatial graph also in the case in which, for example, no user has highlighted any term or has made any search. The system is in fact capable of realizing a classification of the most recurring terms in the text, through the implementation of the appropriate function, and therefore of realizing a graph simply on the basis of the diffusion of the terms in the text.
To render the system much more efficient and functional, this diffusion function can then be associated to a further function that considers the frequency of highlighting of a term by the user within a loaded text and/or the frequency of search and selection in a node.
In particular, both the frequency of highlighting and the frequency of selection of a term in a node, when a user makes a search, allow an automatic update of the graph which can therefore be modified automatically by moving a node that was first peripheral towards the centre. For example, if through the pre-set function that counts the selections it is verified that a peripheral node is actually the one that results the most selected by the users to make a search, then such a function allows a modification of the graph that moves the peripheral node, rendering it central.
The same thing occurs through the highlighting of the term of a text in the loaded text file. As new files are loaded, which contain one or more highlighted terms, these terms are computed, allowing a current update of the graph.
Such a method therefore considers interactions of the user both for the preferences indicated and for the terms searched and, on the basis of this feedback, re-defines the display, updating the graph continuously.
Obviously, the system could be implemented without problems considering also only one of the functions specified above, that is just the diffusion function of a term in the text, the highlighting of a term in the text by the user who inserts the text and the frequency of search of a term by selecting the appropriate node.
The whole is completed by the process of the pair function (FC) that determines the most relevant associations between the terms of greater importance.
In an immediate way, the user can therefore see a graphic that shows key terms that for him can be of interest in the search. Such key terms, contained in each node, are then arranged in the spatial structure in order of importance from the central one towards the external ones. The user can in this way click the nodes of interest that carry the words searched by him and obtain, like a direct connection, the extrapolation of all the files that contain those terms.
The direct connection between the nodes highlights the correlation of context between the contents of the nodes in a visual and immediate manner.
It is clear that the present invention lends itself well not only to a search of CV, indicated here just for exemplificative and not limiting purposes, but also to the search of any other type of document or file in general.
Further advantages can be deduced from the remaining dependent claims.
Brief description of drawings
Further characteristics and advantages of the present method, according to the invention, will result clearer with the description that follows of one of its embodiments, made to illustrate but not to limit, with reference to the annexed drawings, wherein:
- Figure 1 shows a screen for the insertion of personal data to have access to the system;
- Figure 2 shows the load phase of a text file, for example a CV;
- Figure 3 shows the highlighting phase of the terms considered important by the user that enters his own file;
- Figure 4 shows a chart to explain the way in which the scheme is generated;
- Figure 5 shows the node-like scheme that is generated; - Figure 6 shows the calculation matrix for the graph;
- Figure 7 shows a central server connected to the network that allows the access to said system by external users that want to make use of it;
- Figure 8 shows the result of a search as a consequence of the selection of a certain node;
- Figure 9 shows the possibility of zoom of the programme;
- Figure 10 and figure 11 show the possibility of being able to regenerate the scheme 5 starting not from the central node suggested by the system, but instead from a node chosen by the user.
- Figure 12 is an example of construction of the graph;
- Figure 13 is a matrix of the pair function in accordance with the second variant.
- Figure 14 shows a screen through which it is possible to filter the search;
- Figure 15 shows the screen that allows to provide the user with a summary chart;
- Figure 16 shows a macroscopic flow chart that leads to the generation of the graph in accordance with the invention;
- Figures from 17 to 19 show other flow charts relative to the weight function, the pair function and the selection of a node from the generated graph.
- Figure 20 shows a simplification example of the graph by reducing the number of connections between the nodes through the application of a known algorithm.
Description of some preferred embodiments
A structural and a functional description of the invention is indicated hereinafter.
In accordance with the invention, a system that allows to file and process data in a dynamic way is foreseen, allowing a display of such data in a graphic and immediate manner through a node-structure graph.
With particular reference to figure 1, the generation of a common window 1 is shown, which allows a user to make his own registration through the entry of one or more personal data. All the requested fields are therefore filled in, such as name and surname, for the individuation and cataloguing of the subject.
The page is generated by a software that can be implemented within a server. The server can be of the central type, or connected to the Internet. In this way, the system generated results accessible by different users placed at a distance one from the other. In the present description, the term "users" means not only those people who want to enter a text file (for example a CV) in order to optimize the highlighting of their credentials, but also those people who want to consult this database.
Each user can therefore make a connection to the Internet from his own PC and have access to the system which, in the case of data entry in general, requires an initialization as per figure 1.
The system implements the possibility of loading one's own CV (or any other type of electronic text file) or to the same page or to a dedicated page, as for example shown in figure 2.
Figure 2 therefore shows a page 2 that highlights a screen through which it is possible to load, in a known way, an electronic file, for example the CV. The window therefore allows to access to the system file of one's own personal computer on which the programme runs and load the desired file.
The file loaded is then processed by the system in order to contribute to the construction and/or the update of a graph as the one in figure 5, wherein the display of nodes is included, starting from a central node that is connected to peripheral nodes, and wherein each node highlights a functional term indicative for the person who wants to make a pointed search.
The graph generated is not static but dynamic, in the sense that it is modified continuously on the basis of:
1) New files loaded;
2) The highlights made by the users, that is the choice of the users to highlight or give importance to some terms;
3) The frequency of search of a term by the external users that consult the database and;
4) The diffusion itself of a single term, that is its recurrence, within the loaded texts.
A specific function, as better explained in detail below, will deal with the creation of connections between a node and another one.
To that aim, specific functions are implemented, described below in detail, and of which each one is suitable for considering said parameters in order to create a classification of importance of the terms and of their connections (pairs) and thus construct or update continuously a graph, as the one in the example of figure 5.
The first aspect, which is described right below, refers precisely to a function that allows a user to highlight, when entering a text, one or more terms considered relevant by him. Said terms have therefore a specific importance in the generation of the graph, as clarified below. For example, in case of loading a CV, terms considered relevant may be highlighted for the purpose of underlining particular features.
This allows to optimize in a visual manner some aspects considered relevant to the user that enters a text and, at the same time, such particular features become visually available for the person that consults the database. Moreover, this allows the user to indicate to the system to which terms he wants to give greater importance when the system processes the graph.
As shown in figure 3, the system generates a screen that visualizes the content of the file loaded and memorized by the server 10. At this point, the user has the possibility of indicating to which terms the server must give greater importance in the process, that is which are the terms that should be memorized. The server 10 memorizes all the terms of the text and, obviously, memorizes also the terms that have been selected by the user. The words highlighted are therefore memorized together with the rest of the text by the server. The system, in the process, takes into consideration all the words, but will give a greater importance to the ones selected by the user in the creation and update of the overall graph of figure 5, as described in detail below.
The graphical user interface, controllable, for example, through the mouse or the keyboard of the PC, has the shape of a marker or highlighter, for example (see the marker in figure .3) but any other shape could be used. The graphical user interface is therefore in fact a normal input that indicates to the system how certain words must be memorized to be then re-processed in accordance with a specific function described below.
Figure 4 shows, just as a way of example, a chart in which six users are present, who have entered their' files (for example, their CV) in the programme. The number of six users is not at all limiting and is here indicated just as a way of example. The chart of figure 4 , for example, extrapolates parts of the text contained in the file in order to make understand the operational logic of the system.
The chart shows, for each user, a part of CV loaded by the user and the words highlighted by him. The system is capable of memorizing all the words by distinguishing those highlighted, and therefore marked by the user with the marker.
Figure 4 shows, through the chart, how the first user (User 1) has highlighted the word "programmer" and "java". The chart shows in fact in the relative column "CV" a part of the text just as a way of example and, in the column besides, the words highlighted. The same has done the user number two. The user number three and the user number four, instead of the word "java" have underlined the word "net". The user number five and the user number six, instead, have highlighted, in addition to the word programmer, the word "javascript".
In this way, as shown in figure 5, the system is capable of organizing and creating a visual structure appropriately organized that takes into consideration said highlightings through an appropriate function.
The aspect of the highlighting is just one of the elements that are implemented, with an appropriate function, in order to optimize the creation of the graph of figure 5. The system, through other functions, is able to take into consideration also the frequency of search of a term (therefore selection of a node at the moment of the search) and of the diffusion of one term in the loaded texts .
The system, moreover, keeps track of the presence/ absence of all the possible pairs of the relevant terms, that is the terms that result from the process of the preceding functions. The definition of such pairs and of the relative importance allows then to construct the graph. This allows not only the realization of a functional graph, but also a continuous update of it on the basis of the searches made (selection of nodes by the users) , on the basis of the load of new text files with relative eventual highlighting of terms and on the basis of the diffusion of terms in the text itself.
In brief, in fact these functions, which take into consideration precisely the diffusion of a term, highlighting and search, create a classification that is updated continuously at the moment of entering new files and/or during the consultation of the database.
It is clear that, obviously, a user does not have any obligation to highlight a term since, if not highlighted, the functions will anyway give a result that in some way will contribute to the eventual modification of update or creation of the graph.
The system therefore implements these functions, obtaining at the end a graphic structure 5, as per figure 5, which puts at the centre the most recurring and/or searched and/or highlighted word and which is connected to the other more peripheral words.
In the graph of figure 5, therefore, the term "programmer" is placed at the centre because this is the word that not only results highlighted, but is also the one search the most and the most diffused.
The implementation of the functions described below therefore allows a continuous update of the graph.
Just to fix ideas, the graph of figure 5 translates what is shown in the chart of figure . The central node- 6 that is generated, therefore, sees the term "programmer" because this is the most recurring word and the most highlighted. The calculation functions therefore take into consideration the highlighting and the diffusion of the same word. The word programmer is highlighted in all the texts and present in all the texts. The result of the functions is therefore that by which the term programmer is placed at the centre of the graph. Always as shown in figure 5, the central node is connected directly to some peripheral nodes, which in this case are "java", "javascript" and ".net". This is clear from the chart 4, for said terms are always highlighted but less diffused.
The other nodes (in this case "portlet", "servlet", "ajax", "jQuery", "asp" and "basic") are still more peripheral because they are not highlighted and are present only in some loaded files, not in all of them, and are therefore less diffused. These are therefore connected with the nodes with which they have direct correspondence on the basis of the presence/absence of the terms in the same file.
Other terms are discarded by the system because they are articles, simple prepositions, articulated prepositions, or because they are too much diffused and therefore generic. Think of the terms, for example, "resident", "address", "mobile phone", "street", "driving licence", "military obligation fulfilled" in a Cv file.
The final graph is therefore a visual representation of the data entered, highlighted and searched. Each node of the graph is a word to which the system has associated a particular importance or weight through the diffusion, the highlighting of it and the number of times that it has been searched.
At the centre, therefore, the word 6 with greater weight and importance (programmer) . On a first level the words of minor importance ("javascript", etc.) and on a second level the words of secondary importance ("jQuery" etc.). The system creates a graphic structure which is obviously connected directly with the files from which the graphic itself has been generated, as described more clearly in the part related to use.
The analytic system that allows this graphic generation is the following.
The fundamental concept is precisely that of statistically generating a function, called "Weight" function, which is capable of analyzing all the terms contained in a loaded text, in order to construct the node structure containing an indicative term functional to the search and connected to the texts that contain said term.
The pair function, instead, defines the direct connections between the nodes, both in the initial creation of the graph and in its subsequent updates, in such a way as to create a correlation of context in a visual way.
In the preferred embodiment of the invention (first embodiment) , said "Weight" function is based not only on the frequency with which a term has been underlined, but also on the frequency with which this term has been searched and, above all, how much diffused is in the text. This makes that the graph generated of figure 5 can be updated continuously not only through the introduction of new files by the users, but also through a check in real time of the most searched terms by those who search a type of profile.
Analytically, in the implementation of the software, the so-called weight function is defined by this formula: f(Ti) = Tf(Ti) x [log(lnvTd(Ti))+l] x (IndTag(Ti)-hl) x (tndSearch(Ti)+l)
This function depends on different parameters, which are described below:
The first parameter is an "Search index function", said "IndSearch" in the weight formula "f (Ti) " indicated above, which indicates the frequency with which a user, which consults said database, searches in the texts a certain term. This parameter, in the end, counts and memorizes the number of times a specific node containing a specific term, for example the term "programmer", has been selected for the search.
In such a way, such a "Search index" function allows to update a graph on the basis of the number of times a term in a node is selected, said function contributing to modify the value of the Weight function every time a search is made.
Consequently, the weight function f(Ti), responsible for the generation of the graph of figure 5, is updated continuously on the basis of each selection of the node that is made.
A second parameter is the "Frequency of diffusion" of the term within the archive given the relation of the number of occurrences of the term Ti (Ti is the generic ith term of the archive) divided by the number of terms present in all the documents of the archive (NT) . number of occurrences of the term Ti
Tf(Ti) =
NT
This second parameter allows to take into consideration, therefore, in the generation of the graph, the number of times a term is present in the assembly of terms of the archive.
This function of diffusion Tf(Ti), precisely, memorizes the number of times each term is present in all the documents loaded in the server 10 and memorizes such a result for each term (Ti) . The diffusion of a term is therefore another essential parameter for constructing the graph since a too much diffused term, for example "programmer", is indicative of an interest for the person who has entered the text file.
A third parameter is defined by the function "InvTd(Ti)". This third parameter is conceptually correlated to the preceding one in the sense that the preceding one calculates the diffusion of a term to assign to it a value of importance, while the present one calculates always a diffusion in order to understand if said terms can be discarded because too much diffused and therefore generic. To determine said third parameter, it is necessary to define first the parameter Td(ti) from which it depends and given by the underlying formula (ti is always the ith generic term) .
The underlying formula represents the frequency of one single term in all the documents loaded (ND is the total number of documents loaded in the server) .
number of documents that contain the term Ti
InvTd is the inverse of the formula indicated above, that is:
Inv ί d( 1 ή
number of occurrences of the term Ti
In the weight function the logarithm of such a value is implemented. This logarithmic function is extremely important in the assembly of the weight function since it serves to determine how much a term is diffused/rare within the entire archive. A term that is extremely diffused contextually in all the loaded files (for example, "resident" in a Cv file) would have a high td(Ti), therefore a invTd value close to 1 and the log of such a value close to zero. This means that the overall weight function does not vary on the basis of a word that is substantially common in all the archive files and therefore that is not distinctive of its own text file.
A further parameter is "IndTag", which is simply the index that counts the number of times the term has been tagged, that is highlighted by the user. Said index is updated every time the user selects one or more terms and memorizes for each term ith if it has been highlighted by the user.
The weight function introduced above is therefore calculated through the following final formula, as also shown above, and that contains the essential parameters introduced above.
f(Ti) = Tf(Ti) x [log(lnvTd(Tt))+l] x (lndTag(Ti}+l) x (lndSearch(Ti)+l)
The formula, therefore, takes into consideration both the search index (IndSearch) but also the highlighting index of the term (IndTag) , of the diffusion of a term in the text (Tf(Ti) and of the diffusion of the same term in all the texts (log (InvTd(Ti) ) . The same function is not influenced by a common diffusion of the same term in the archive through precisely the logarithmic function of the inverse of Td(Ti) "InvTd(Ti)".
As a consequence, the more a term in a text has been tagged and/or searched (or selected in the node) and/or diffused, the greater its importance will be.
It is clear that the weight function, which creates a classification, will take into consideration a certain number of relevant terms. It is therefore included a minimum threshold of terms to be considered (for example, the first ten or the first hundred of the classification, with the consequent discard of all the remaining ones) . This allows to realize non dispersive graphs.
The minimum threshold of terms to take into consideration can be set by "default" or, preferably, set freely by the user through the cursor 60 of figure 14.
In this case, the user can use such a horizontal cursor on the interface, making it slide in a direction or in the opposite one, and on the basis of this, visualize a graph composed of a pre-determined number of relevant terms, discarding those placed at the end of the classification realized by the weight function. Actually, the horizontal cursor gives the density of the graph.
In a practical example, to clarify the implementation of the functions described above, the following takes place:
Let's suppose to have just two text files, for example two CV, both of which contain the term "address" and "engineer".
The term address is present in both texts, since, generally, all the CV have a heading that includes the name and surname and address.
The term engineer is a feature that the user will probably highlight as an aspect of interest.
The aim of the Tf(Ti) function is that of giving a certain importance to the term "engineer", while the inverse of a logarithmic function Log (Inv (Td (Ti) ) must "discard" the term "address", being it absolutely common and not relevant to the search.
In such a sense, the system implements on the basis of the formulas indicated above a value of Td(Ti) = 1 (number of documents that contain the term/divided by the total number of documents) both for the term "engineer" and for the term "address" and therefore for both the inverse of a logarithmic function assumes the value of 0.
To such a value, 1 is added to avoid annulling the entire formula.
The value of Tf(Ti) is instead 1 both for the term "engineer" and for the term "address".
Therefore, in an initialization phase, the system constructs a graph with the nodes "address" and "engineer" correlated between them because they have the same importance. Subsequently, as the system is used through searches or new loaded files, the highlights and/or the searches of the users will confer greater importance to a term or to another one, allowing a modification of the graph. That is, if I enter a new file in which the term engineer is underlined (or the node with the term engineer is selected) , it is clear that the new weight function implemented will make that the term engineer has a greater relevance with respect to the term address, consequently implementing a new graph with the term engineer at the centre and with the node "address" eventually discarded if, as said above, has a weight value below a pre- established minimum threshold.
The final weight is therefore implemented in such a way as to generate and modify through time in a dynamic manner the graph as per figure 5.
It is clear that the implementation of an overall weight function that takes into consideration the various parameters for the creation of a graph can be different from the one described above.
For example, in a variant of the invention, the weight function can be implemented according to a second formula as indicated below:
f(Tk) ~ (IndTag(Tk)-l) x (lndSearch(Tk)+l) wherein, however, the term Tk that is implemented is that which satisfies the following condition:
_ O
Temp(Ti) - \0^ΙηνΤ(ΐ{Τί) + 1
In this case, it is a relation between the Tf(Ti) function, which represents the diffusion of the term in the single document, and the known inverse logarithmic value of Td(Ti) that takes into consideration the diffusion of a single term in the archive.
The denominator of the fraction is increased by 1 to avoid the division by zero.
For example, in a Cv file all have the term "residence", but a "java" expert will have the term "java" more frequently than a "java" junior programmer. "The equilibrium" between the two values Tf and invTd gives the importance in general of the term.
At this point, all the terms whose function Temp(Ti) is comprised as in the range shown below, are examined
0,5 < temp(Ti) <l,5
Other temp(Ti) values can be those close to 1 or equal to 1. Other values could be comprised between 0,7 and 1,3 or between 0,8 and 1,2. The values indicated above, both the range and the specific values, are indicative of the calculation of a weight function just for relevant terms.
In the weight function formula in accordance with said variant, the IndTag and IndSearch indexes are increased by one because initially these values can be equal to zero in the case in which the term examined has not been tagged or searched yet.
Another important function in order to construct the graph, in particular in order to realize the connections between the nodes, is the pair function, as already said.
This function is therefore calculated subsequently to when the weight function has been implemented and, thus, has been created a first index of importance of the terms contained in the texts.
For example, below is shown a sequence of terms T considered important by the weight function and with a classification of their importance.
TMAX, 71, 7 , 73, 14, 15... In
Wherein :
IndexMAX > Index 1 > Index2 > Index3 > Index5 > Index5 > ...IndexN and wherein the weight functions result the following:
MAX) - IndexMAX
s' Index 1
~. Index2
f(TB} ~ Index3
ffT4)s Index4 f(Tn) = IndexN
Physically, therefore, the pair function must now realize the connections between a node and another one in such a way as to generate the complete correlation.
The pair function, which connects a node to another one, therefore verifies the coexistence of the terms of great importance within the texts.
To that aim, in a first embodiment, a square matrix is constructed where the value of the cell (i, j) indicates in how many documents both the term Ti and the term Tj (see figure 6) are present.
If, for example, only two CV are present, of which the preceding step on the calculation of the weight functions has written a classification by which the three terms "programmer", "java", "php" are those that have a maximum weight value.
The system therefore implements the construction of the graph on said terms. In particular, said square matrix is constructed, wherein the lines and the columns show the terms with a certain importance.
The matrix will have a shape as the one shown below:
Figure imgf000024_0001
The software places in the cells of the matrix the number of files that contain the line-column connection. In the example, just one document contains the pair "programmer-j ava" and a single document contains the pair "programmer-php" .
The result of the matrix suggests that a connection should be designed between "programmer-j ava" , a connection between "programmer-php" and no connection between "php" and "java".
Actually, the matrix is calculated for all the terms for which there is a weight value for which it is necessary to understand which to eventually omit.
The system therefore calculates the product of the cell (i,j) with the indexes of tag (that is of highlighting) and of search: f(Ti, Tj) = c(ij) x (IndSearch(Ti)+l) x (IndSearch(Tj)+l) x (IndTag(Ti)+l) x (IndTag(Tj)+l)
The system puts in order the pairs Ti, Tj according to weight values calculated as initially indicated.
Let's suppose that from the function above the following results
f (programmer, resident) =1
f (programmer, java)=5
f (programmer, servlet)=3 etc
the order of the pairs will be:
programmer-j ava
programmer-servlet
programmer-resident .
Therefore, values below an established minimum threshold are discarded.
The case can also be given in which the pairs of terms are redundant. For example, let's suppose that the graph is constituted of three nodes (a, b, c) , as per figure 20, and that they are all three connected a-b, b-c, and c-a. Let's suppose that the arc a-b has a value of relevance of 50, b-c of 30 and c-a of 20. There would be a triangle connection, as the one in the left part of figure 20. Nevertheless, the pair c-a is redundant and can be eliminated. An algorithm that is well-known in computing as "minimum spanning tree" can then be used and for example, can be found on the following Internet site: http: //it . wikipedia . org/wiki/Albero ricoprente .
The algorithm above suggests that the arches a-b and b-c be visualized and that the arch c-a be "discarded" because it is of minimum cost and because the connection c-a is guaranteed by covering the other two arches. This has the technical advantage of allowing not to overload the design of arches, but to visualize "the minimum spanning graph" .
In a variant of the invention, the pair function could be calculated instead by constructing a matrix (N+l) x M, as per figure 13, where N+l is the number of terms and M is the number of documents. The value of the cell i,j indicates the frequency of the term Ti in the document
Dj.
Subsequently, for each pair of terms Ti, Tj the cosine of similarity between the columns i and j of the matrix constructed is calculated.
Figure imgf000026_0001
Therefore, the system calculates the product of the cosine with the indexes of tag and of search: f(Ti, Tj) = cosine (Ti.Tj) x (IndSearch(Ti)+l ) x (IndSearch(Tj)+l ) x (IndTag(Ti)+l ) x (]ndTag(Tj)+l )
The system puts in order the pairs Term - i, Term - j according to the values f (Ti, Tj ) calculated in the preceding point.
In a further variant of the invention, the same implementation, but simplified, could be foreseen, that is based essentially on the index of diffusion and without taking into consideration the index of search and/or the terms highlighted.
In brief, whatever the implementation used might be, the system is capable of obtaining each time a graphic as that of figure 5.
The degree is therefore constructed in the following way. The server contains a certain number of documents and, for each document, the user has selected, that is highlighted, a certain number of terms of primary importance .
The server executes at this point one of the two weight functions described and obtains the list, that is the classification, of the terms of greater importance.
Subsequently, the server implements also one of the two pair functions described above for realizing the connections .
Once the list of pairs put in order according to the index associated to the pair itself has been obtained, the system is ready to design the graph of the terms.
Let's suppose, for example, that the system has determined the following list of ordered pairs:
e - : (TMAXf i(TMAK lll (TMM,T22l... (TMAXJn)
Then the system puts at the centre the term with the greatest index T AX and the other terms are connected directly to it (see figure 12) .
In use, therefore, a company that searches a professional figure can select only those headings of its interest and obtain a number of profiles already creamed off from the start. For example, those who search a "java" programmer and "portlet" will be able to obtain only the requested profiles (which include the characteristics of "programmer" - "java" - "portlet" together) , by selecting both the icon "portlet" and the "java" one. The selection of the nodes allows a direct connection to the files loaded and connected to them, therefore filtered.
To those who are interested in just one profile of "java" programmer, the corresponding node will then be selected.
The graphic scheme allows to those who want to use the database to visualize immediately the data that could be of their interest.
Statistically, with the system of calculation thus implemented, almost all or all the documents extrapolated will contain data relative to the terms directly connected between them.
Likewise, those who enter their own profiles in the database will be able to do so by doing in such a way that these profiles highlight particular features considered important for them.
The system, as indicated at the beginning of the description, can without problems be centralized through the Internet, as shown in the scheme of figure 7, allowing the user to enter his own data comfortably from his personal computer and, on the other hand, allowing the consulting of them to any interested person.
In such a sense, a central server 10 can be foreseen, which implements the functionality through an appropriate software. The server is connected to the Internet in such a way as to be accessible by one or more users provided with an appropriate PC or electronic device that can have an Internet connection. Once the connection to the server is made, the pages described in figures 1, 2, 3 and 5 are opened in succession, thus allowing to enter data or to consult a database on the basis of one's own needs.
Always figure 7 therefore shows, on the one hand, a user that enters his own data through his own electronic processor 20 and, contextually, the display of the system by a third user (for example a company) is shown, who searches a pre-determined profile and both connected to the server 10.
Figure 8 therefore highlights a possible screen in which all the results obtained by the selection of a certain node are highlighted and therefore where it is possible to download the files requested.
Therefore, by referring to figure 5, in case the node "jQuery" is selected, a screen will appear as that of figure 8 that highlights the results, for example by line, and with the results thus organized:
The first line highlights the texts that contain the term "jQuery", that is those texts in which the term "jQuery" is present and the same term has been contextually highlighted by the users. The system suggests such texts among the first results because the user searches "jQuery" and the candidate has underlined precisely "jQuery".
The second line shows texts that contain "jQuery" which, however, has not been highlighted by the candidate that entered the CV. The system considers such results of second level because the user searches "jQuery" but the candidate has not indicated it as preference.
The selection of a line, as per figure 8, gives access to the texts which can be downloaded and/or printed.
The system, with obvious known software technologies, allows then to browse the map of the nodes and to widen the view of an area of interest, as per figure 9.
As per figure 10 and 11, the system also allows to re-centre a word with respect to the entire graph.
In this case, the system uses the same principle described, but creating a tree with the new centred word.
In particular, figure 10 shows a diagram with the central node of "programmer". The user can request to centre "java" .
The new node is only considered as starting point of the process.
For example, the graph has processed an assembly of technical CV in which the central node is "engineering" and the peripheral ones "computing", "building", "electronics", "civil", "spatial", etc. The user that searches can be interested only in civil engineers, and not in the others. In that case, the user can select the node "civil" and request the system to exploit it. At this point, the system can exploit the graph "underlying" and show: "building", "hydraulics", "structures",
"architecture", etc.
In that case, the formulas indicated above are again re-implemented for the generation of the under-graph.
With known computing technologies, the user can create a search filter, as shown in figure 14, which displays the part of the graph that satisfies the condition of filter-threshold set.
The filter represents the threshold indicated above in the text in order to create the graph, that is to consider a threshold both in the weight function (for example the first 10 classified) and in the pair function.
Figure 14 shows, as a way of example, a sliding bar 60 and a window 50 which allows the entry of key words to make the filters.
In particular, the filter can be made on the basis of the number of words that wants to be visualized.
The graph can also be filtered by using the attributes of the personal data that the user fills in when he enters the text. The system can associate to the single text different types of personal data. As a way of example, but not limiting, we can consider personal data that shows an address (city, postal code, street, etc.); the system can display the graph generated considering just and only the texts for which the attribute city is "rome".
Figure 15, moreover, shows a further implementation of the system that allows, to the person who has entered a file or who has consulted the database, to receive in return a summary chart .
The summary chart shows the graph generated with the functions described before but in this case the system has generated the graph by processing the single text taken, for which to generate the summary chart, and not the entire archive.
The system can implement other functions such as those of contextual help. In this case, the system interacts with the user at the moment of the search and at the moment of the selection of the key words to highlight.
For example, at the moment of the entry of a new text file, the system, having already processed a graph and therefore having already created a classification of importance and pairs as described above, can suggest that the user highlights certain key words (obviously if present in the text in the load phase) that have contributed to the construction of the graph.
Let's suppose that the user has selected the term Tl
(ex. "java") and that this term is connected on the graph to two terms Til (ex. "servlet") and T12 (ex. "portlet") not selected by the user, though.
In this case, the system, through an e-mail or a notice on the graphical user interface, can confirm to the user the validity of the choice Tl, but suggest also the highlighting of Til and T12 because they are present on the text and much searched by the users.
In the search phase, instead, with reference to figure 14, the system can interact with the user to help him with the search. If the user has selected Tl and the system has given too many results, the system can suggest the further selection of T12 and Til to which they are directly connected but are more specific and that will therefore give place to a more pointed search.
In the present description, just as a way of example, reference was made to the entry and search of CV.
The example is not limiting for the system that is anyway intended as useful for every text search.
For greater clarity it is here stated that the examples are based on the terms "programmer", ".net", "javascript", "java", "servlet", "portlet", "basic", "asp", "ajax".
The relation between the terms is the following:
".net", "javascript" and "java" are programming languages. "Servlet" and "portlet" are two software modules that can be developed with the programming language "java". "Basic" and "Asp" are two programming languages of the ".net" world. "Ajax" and "j Query" are two libraries "javascript".

Claims

A method for creating an electronic archive that can be consulted and comprising the operations of:
- Load and memorization of one or more text files (FT) in a server (10);
- Generation and display by the server (10) of a diagram (5) formed by one or more nodes (6) placed in the space and containing graphically, each one of them, a term (Ti) present in one or more of the loaded text files (FT) and indicative for the search to be carried out in the archive, the nodes being connected between them starting from a central node (6) until one or more peripheral nodes;
- And wherein said spatial generation of the diagram (5) is obtained through the implementation by the server (10) of a weight function (f(Ti)) that extrapolates an index of importance of the terms (Ti) contained in each text (FT) loaded in the server (10) to generate the nodes (6) and of a pair function (FC) that connected the generated nodes between them.
2. A method, according to claim 1, wherein the central node highlights the term whose value of the weight function (f(Ti)) is maximum.
A method, according to claim 1, wherein the peripheral nodes highlight each one a term whose value of the weight function (f (Ti) ) is inferior with respect to the central node.
4. A method, according to claim 1, wherein the peripheral ramification of the nodes is function of the decreasing value attributed by the weight function (f(Ti)) at each term contained in the node.
A method, according to one or more of the preceding claims, wherein the weight function (f (Ti) ) is implemented by the server (10) in such a way as to take into consideration at least one of said characteristics :
- the diffusion of a term within the loaded texts;
- the frequency ( IndSearch (Ti) ) of selection of a node during a search carried out by a user;
- the highlighting (IndTag (Ti) ) of one or more terms within a loaded text in the server (10), said operation of highlighting of the term including a display of the loaded text file and the possibility of highlighting one or more terms within it, which are memorized by the server (10) .
A method, according to claim 5, wherein the diffusion of a term (Ti) depends on a logarithmic function (log (InvTd(Ti) ) that discards the common terms contained in a text and on the function Tf(Ti) that highlights the most recurring terms.
A method, according to one or more of the preceding claims, wherein the weight function is implemented by the server (10) in accordance with said formula:
f(Ti) = Tf(Ti) x [log(invTd(Ti))+l] x (lndTag(Ti)+l) x (lndSearch(Ti)+l)
A method, according to one or more of the preceding claims from 1 to 6, wherein the weight function is implemented by the server (10) in accordance with said formula : f(Tk} = (indTog(Tkj+l) x (MSear€h(Tk)*l) and of which the term Tk is the ith term that responds to the following relation:
Figure imgf000035_0001
Temp(Ti) = \og lnvTd(Ti)) + 1 and of which:
0,5 < Temp(Ti) <l ,5
9 . A method, according to claim 1, wherein the pair function is implemented by the server (10) on the terms selected by the weight function. 0 . A method, according to one or more of the preceding claims, wherein the pair function is implemented by the server (10) in such a way as to verify the coexistence in each text of two or more terms selected by the weight function. 1 . A method, according to one or more of the preceding claims, wherein the pair function is implemented by the server (10) in accordance with the following formula:
Tj) = c(ij) x (IndSearch(Ti)+l) x (IndSearch(Tj)- l) x (IndTag(Ti)+l) x (IndTag(Tj)+l )
A method, according to one or more of the preceding claims, wherein the pair function is implemented by the server (10) in accordance with the following formula : cosine (Ti,Tj) x (lndSearch(Ti)+l) x (IndSearch(Tj)+l) x (IndTag(Ti)+l) x (IndTag(Tj)+l) cosine
Figure imgf000036_0001
13. A method, according to one or more of the preceding claims, wherein the server applies to the pair function a further algorithm of the "minimum spanning tree" to reduce the number of nodes.
14. A method, according to one or more of the preceding claims, wherein the selection of one or more nodes allows the extrapolation of one or more texts loaded in the server and containing the terms of the selected nodes .
A method, according to claim 1, wherein an operation of filtering for the search is included.
16. A method, according to claim 1, wherein an operation of creation of a summary chart of the selected operation is included.
17. A method, according to one or more of the preceding claims, wherein a phase of contextual help to the user is included, both in the search phase for redefining the result, and in the data load phase for better highlighting the text.
PCT/IT2012/000160 2012-05-30 2012-05-30 A method for generating a graphical user interface for the optimization of a research on databases WO2013179317A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IT2012/000160 WO2013179317A1 (en) 2012-05-30 2012-05-30 A method for generating a graphical user interface for the optimization of a research on databases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IT2012/000160 WO2013179317A1 (en) 2012-05-30 2012-05-30 A method for generating a graphical user interface for the optimization of a research on databases

Publications (1)

Publication Number Publication Date
WO2013179317A1 true WO2013179317A1 (en) 2013-12-05

Family

ID=46545434

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IT2012/000160 WO2013179317A1 (en) 2012-05-30 2012-05-30 A method for generating a graphical user interface for the optimization of a research on databases

Country Status (1)

Country Link
WO (1) WO2013179317A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059736A1 (en) * 2002-09-23 2004-03-25 Willse Alan R. Text analysis techniques
EP2045734A2 (en) * 2007-10-05 2009-04-08 Fujitsu Ltd. Automatically generating a hierarchy of terms
WO2009158586A1 (en) * 2008-06-27 2009-12-30 Cbs Interactive, Inc. Personalization engine for classifying unstructured documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059736A1 (en) * 2002-09-23 2004-03-25 Willse Alan R. Text analysis techniques
EP2045734A2 (en) * 2007-10-05 2009-04-08 Fujitsu Ltd. Automatically generating a hierarchy of terms
WO2009158586A1 (en) * 2008-06-27 2009-12-30 Cbs Interactive, Inc. Personalization engine for classifying unstructured documents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SALTON G ET AL: "TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL", INFORMATION PROCESSING & MANAGEMENT, ELSEVIER, BARKING, GB, vol. 24, no. 5, 1 January 1988 (1988-01-01), pages 513 - 523, XP002035959, ISSN: 0306-4573, DOI: 10.1016/0306-4573(88)90021-0 *

Similar Documents

Publication Publication Date Title
US10796076B2 (en) Method and system for providing suggested tags associated with a target web page for manipulation by a useroptimal rendering engine
Chamberlain et al. taxize: taxonomic search and retrieval in R
CA2702651C (en) System and method for searching for documents
Russell et al. Nitelight: A graphical tool for semantic query construction
US10678820B2 (en) System and method for computerized semantic indexing and searching
US20140324835A1 (en) Methods And Systems For Information Search
US11023654B2 (en) Analyzing document content and generating an appendix
Winkels et al. Creating context networks in dutch legislation
Noruzi et al. Google Patents: The global patent search engine
Albertoni et al. LusTRE: a framework of linked environmental thesauri for metadata management
Alvite-Diez Linked open data portals: functionalities and user experience in semantic catalogues
Schwabe et al. Design and Implementation of Semantic Web Applications.
JP2008262506A (en) Information extraction system, information extraction method, and information extraction program
KR101478259B1 (en) Teminology ontology search service offering apparatus and the method thereof
WO2013179317A1 (en) A method for generating a graphical user interface for the optimization of a research on databases
JP2012104051A (en) Document index creating device
Keepanasseril PubMed alternatives to search MEDLINE: an environmental scan
Todorov Practical aspects of journal indexing in scientific databases
JP2006155275A (en) Information extraction method and information extraction device
KR101985014B1 (en) System and method for exploratory data visualization
Mudunuri et al. botXminer: mining biomedical literature with a new web-based application
Cuper Researching pandemics through time: A covid-19 inspired data-driven approach to explore historical newspapers
JP5477006B2 (en) Search device and program
KR20000036758A (en) A method for establishing database for searching files and a method for searching file by use of the database
JP7441576B1 (en) Information processing system, information processing method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12737356

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12737356

Country of ref document: EP

Kind code of ref document: A1