EP4133384A1 - Procédé et système informatique pour déterminer la pertinence d'un texte - Google Patents

Procédé et système informatique pour déterminer la pertinence d'un texte

Info

Publication number
EP4133384A1
EP4133384A1 EP21717412.7A EP21717412A EP4133384A1 EP 4133384 A1 EP4133384 A1 EP 4133384A1 EP 21717412 A EP21717412 A EP 21717412A EP 4133384 A1 EP4133384 A1 EP 4133384A1
Authority
EP
European Patent Office
Prior art keywords
text
texts
value
similarity
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21717412.7A
Other languages
German (de)
English (en)
Inventor
Thomas Nitsche
Oxana NITSCHE
Antonia DÜKER
Raphael NITSCHE HAHN
Maxim NITSCHE HAHN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Contexo GmbH
Original Assignee
Contexo GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Contexo GmbH filed Critical Contexo GmbH
Publication of EP4133384A1 publication Critical patent/EP4133384A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the invention relates to a method and a computer system for determining the relevance of a text.
  • the present invention is based on the object of specifying a method and a computer system which enable the relevance of a text to be determined in an efficient manner by comparing it with other texts. This object is achieved by a method with the features of claim 1 and a computer system with the features of claim 23. Refinements of the invention are given in the dependent claims.
  • the present invention then considers, in a first aspect of the invention, a method for determining the relevance of a text.
  • the procedure provides that the similarity of the viewed text with texts from an inventory is first determined.
  • the text is compared with each of the texts in the inventory as part of an individual comparison, with the determination of a similarity value, the similarity value indicating the similarity between two texts in each case.
  • the similarity value of the respective individual comparison carried out is assigned at least to that of the two texts that was published at an earlier point in time or that was recorded for the first time by an acquisition system at an earlier point in time.
  • the similarity values assigned in this way to the viewed text in the individual comparisons form the basis for calculating a relevance value. For example, the similarity values are added or multiplied to a relevance value, the size of the relevance value indicating the relevance of the text.
  • the calculated relevance value is stored and / or transmitted to a communication end system via a computer network.
  • the relevance value is stored together with the associated text and / or in a profile of the text.
  • the relevance of a text or document is thus calculated on the basis of similarity values, which are determined in individual comparisons between the text under consideration and the texts of an inventory, whereby the time of publication or the first time the respective texts were recorded are taken into account when determining the similarity values.
  • the solution according to the invention makes it possible to provide the texts of an inventory with a ranking, the text with the highest relevance value within the inventory under consideration being at the top of the hierarchy. This allows similar texts to be transparently weighted with regard to their relevance.
  • the solution according to the invention also makes it possible to recognize similarity relationships within a set of texts. It is pointed out that the feature that the relevance value is determined from the similarity values determined in the individual comparisons can include a large number of mathematical operations. In the simplest case, the similarity values are added to a relevance value. However, other ways of deriving the relevance value from the similarity values can also be provided. For example, the relevance value can alternatively be formed from a multiplication of the similarity values, or from a combination of addition and multiplication, or from any formula that has the similarity values as parameters.
  • a variant of the invention provides that the similarity value of an individual comparison is only assigned to the text that was published or recorded at an earlier point in time. In this way, the similarity value takes into account the development over time in the use of a text and similar texts and thus increases the relevance of texts that are ahead of other similar or identical texts.
  • the similarity value of an individual comparison is assigned to both texts of the respective individual comparison, while the similarity value is assigned to the text of the individual comparison that was published or recorded at a later point in time, but is assigned a lower weighting.
  • the similarity value is assigned to the text of the individual comparison that was published or recorded at a later point in time, but is assigned a lower weighting.
  • the texts of an inventory can, but do not have to, have matching keywords.
  • the relevance value of each document can be zero.
  • refinements provide that the texts of an inventory are linked to one another via at least one matching keyword or the fulfillment of another predefined degree of similarity. This can be the case, for example, if the inventory is filtered from a larger inventory (for example by means of a search engine) in order to carry out the relevance check on a smaller number of texts.
  • the inventory can contain any documents. If there is no similarity between two documents, for example if they do not share two keywords, the similarity is zero.
  • the similarity value of two documents is symmetrical, ie the similarity value between two texts determined in an individual comparison is the same for both texts. According to the invention, however, the similarity value is then only assigned to one of the texts when calculating the relevance value, or is more strongly assigned to one of the texts.
  • Text within the meaning of the present invention is understood to mean any sequence of words that are separated by one or more separators (blank, period, comma, etc.) or whose separation (e.g. Chinese) results from the sense of the text.
  • An example is: “Everything will be fine”.
  • a text within the meaning of the present invention can be a document or a part of a document.
  • the point in time at which a text was first published or recorded results, for example, from corresponding metadata stored with the text or assigned to it.
  • the texts displayed in the context of an RSS feed contain the date and time as a time stamp.
  • the date and time can be determined, for example, by determining when a text was first recorded using a preferably periodic recording system.
  • a document has a date of creation or, in the case of a change, a date of change, for example on a computer or on the Internet.
  • the creation date of a document or text is used as an assigned time stamp, which is then included in the calculation of the relevance value according to the invention.
  • a document does not have a generic time stamp
  • the point in time when the document was first recorded by a periodically crawling web crawler can be used as the relevant point in time or as a time stamp.
  • documents are recorded via RSS feed, for example, a time stamp is assigned to a text, as mentioned.
  • the earliest acquisition date can again be used as a time stamp.
  • the method according to the invention is used in one embodiment variant in order to better rank search results of a search engine or to rank the results of a search query.
  • search results are ranked according to their relevance to a search text.
  • the solution according to the invention makes it possible to additionally rank according to the originality of a text which is indicated by the relevance value determined according to the invention.
  • the cumulative similarity values or the summation of all determined similarities of a document or text to all other documents or text of an inventory are considered as a relevance value.
  • This relevance value (SRank- “Similarity Rank”) assigned to the document at a point in time becomes stored in the document and used when ranking a hit set of a search process. The higher the SRank of a document, the higher up it is placed in the hit list.
  • a ranking list or a ranking is provided by a search engine in response to a search query which contains at least one keyword as a search term that is contained in the text. It can be provided that the relevance value is only one of the criteria of the search engine for the order in the hit list, so that in addition to the relevance value, further parameters or individual criteria are included in the determination of the order of the hit list.
  • Further applications relate to a ranking in the classification of messages that are presented to a reader, this optionally being able to take place in the context of a search query, that is to say representing an application of a ranking with a search engine.
  • the first message in a message chain is relevant because it presents the oldest text or the text with the greatest originality.
  • news articles are ranked based on their news.
  • the method according to the invention makes it possible to identify articles which influence the relevance value and which form the origin of a message chain.
  • Another application example concerns the ranking of working papers within an organization.
  • One embodiment of the invention provides that if the similarity of the text to a text in the inventory, which is determined in the context of an individual comparison, exceeds a threshold value, the similarity value is incremented by an additional value.
  • the similarity value incremented by the additional value is assigned at least to that of the two texts of the individual comparison carried out that was published at an earlier point in time. Variants are provided that only the similarity value of the earlier published document is incremented or that the similarity value of the earlier published document is incremented more strongly than the other similarity value. It can further be provided that a higher similarity results in a higher increment.
  • This configuration increases the relevance value of texts that have a strong similarity to other texts, while texts with a low similarity are less important. This creates a cluster of relevant texts and it is easier possible to determine the relevant texts in the case of a large number of texts and to rank them among each other.
  • a similar result can be achieved if, if the similarity of the text with a text in the inventory, which is determined in the context of an individual comparison, falls below a threshold value, the similarity value is set to zero, i.e. H. the result of this individual comparison is not included in the determination of the relevance value.
  • One possible application is, for example, the tracking of messages, in particular so-called “fake news” on the Internet.
  • a detected cluster formation can make it clear that “fake news” may have been disseminated by a group of users in related texts within a short period of time.
  • Another exemplary embodiment provides that the determined relevance value of a text is divided by the number of texts in the inventory or by the number of texts in the inventory for which the respective individual comparison resulted in a similarity value other than zero, or the relevance value in others Way is modified by this number.
  • a relevance value modified in this way is essentially independent of the number of texts in the inventory.
  • the method according to the invention is fundamentally transparent with regard to the method with which the similarity of two texts is determined. In principle, any method can be used for this purpose.
  • Determining a keyword relevance value for each of the identified keywords which indicates the relevance of the keyword in the respective text considered, and deriving a similarity value from the number of keywords that match in the two texts and the keyword relevance values assigned to these keywords.
  • Keywords are thus determined in the two texts, these are weighted to form a keyword relevance value and a similarity value is derived from the matching keywords and their weightings.
  • a keyword in the sense of the present invention can be a large number of entities.
  • keywords of a text are the names and / or the nouns that are contained in the text.
  • the names and nouns of a text are determined as keywords.
  • other characterizing words of a text can also be used. It can be provided that the stems of the respective names and nouns or other keywords are considered as keywords, which are also referred to as “lemmas” in the following. Examples are:
  • Example Lemmal: Lemma (went) go
  • Example Lemma2: Lemma (houses) house
  • Example Lemma3: Lemma (went) go
  • the lemmas are used as keywords.
  • keywords of a text are n-grams of the respective text.
  • An n-gram is a partial sequence of letters of a word or of several consecutive words.
  • An example is: The word .chess' contains the 3-n-grams:, sch ‘,, cha’,, hac ‘and, ah‘. N-grams can also run across word boundaries.
  • keywords do not necessarily have to appear in the text itself. If this is not the case, a keyword is a characterizing feature of the text that can be derived from or assigned to it.
  • An example of a characterizing feature in this sense is as follows:
  • Example characteristic 1 A text about the company Apple can be characterized with the characteristic "Personal Computer", although this term does not appear in the text.
  • n-grams are also features in the sense mentioned.
  • Characterizing features are stored in a database, for example, together with keywords or text parts or complete texts, so that they can be clearly assigned to a text.
  • One embodiment of the invention provides that the frequency of the keywords in the respective text is determined as the keyword relevance, a frequency value being assigned to each keyword as the keyword relevance value. Accordingly, for this case, the similarity value is determined from the number of keywords that match in the two texts and the frequency values assigned to the respective keywords.
  • the tf-idf measure is used as the keyword relevance value, the keyword relevance value being equal to the product of a frequency value assigned to the respective keyword with the inverse text frequency in the texts of the inventory.
  • the tf-idf measure is basically known.
  • the component "tf” indicates the search word density or frequency of occurrence in the text under consideration.
  • the “idf” component denotes the inverse document frequency, which indicates the specificity of a keyword for the total amount of texts in the inventory considered. This is based on the consideration that a matching occurrence of rare terms is more meaningful for the relevance and accordingly increases the similarity value and thus the relevance value more.
  • a standardized profile of the respective text is formed by using standardized keyword relevance values as keyword relevance values, which are generated from the keyword relevance values by dividing with a normalization factor. It is provided, for example, that the normalization factor is equal to the maximum keyword relevance value that occurs in a text under consideration (e.g. equal to the highest frequency value of the text). There are numerous methods of normalization that can be used.
  • one embodiment provides that a similarity value to two texts of an individual comparison is derived from the number of matching keywords and the standardized keyword relevance values assigned to the respective keywords by adding the sum of the mean values of the standardized keyword relevance values. Matching keyword values is determined. If the keyword relevance values are frequency values, for example, the similarity value is determined from the sum of the mean values of the normalized frequency values of the matching keywords.
  • a further embodiment variant provides that a filtered profile of the respective text is formed from the standardized profile of the respective text by using filtered keyword relevance values as keyword relevance values, which are derived from the normalized keyword relevance values are formed by filtering the normalized keyword relevance value with a threshold value.
  • the normalized keyword relevance value is only retained if it is above the threshold value and is otherwise set to zero.
  • one embodiment provides that a similarity value to two texts of an individual comparison is derived from the number of matching keywords and the filtered keyword relevance values assigned to the respective keywords by adding the sum of the mean values of the filtered keyword relevance values. Matching keyword values is determined. Since the filtered keyword relevance values are set to zero, provided they are below the threshold, only those keywords are included in the similarity value that are present in both texts with high relevance.
  • a keyword is a characterizing feature of a text that can be derived from or assigned to it without appearing in the text itself, it can be provided that the relevance of this keyword, i. H. its keyword relevance value is or has been determined externally. The relevance can be based, for example, on the importance of the keyword. If, for example, the name of its author is assigned as a keyword as the characterizing feature of a text, then it can be provided that this keyword always has a high keyword relevance value of, for example, 1.0 or 0.75 after normalization.
  • a further embodiment of the invention provides that the texts of the inventory are stored in a database, with at least the following being stored for each text: the time at which the text was first published and a profile of the text.
  • the profile of the text was created by capturing keywords of the respective text and by determining a keyword relevance value for each of the identified keywords, which indicates the relevance of the keyword in the text. This can be done in the manner described.
  • standardized keyword relevance values and / or filtered keyword relevance values can be stored in the database as keyword relevance values.
  • the profile is retrieved from the inventory and a similarity value is formed from the number of matching keywords and the keyword relevance values assigned to the respective keywords.
  • the present invention includes both embodiments in which the text, its relevance is part of the stock of texts, as well as configurations in which the text, the relevance of which is to be determined, is not part of the stock of texts.
  • the profile of the text whose relevance is to be determined is already stored in the database so that it can be called up from the database, just like the profiles of the other texts in the inventory.
  • a profile of this text is generated and stored in the database together with the time at which the text was first published.
  • Another embodiment provides that the method is applied to all texts in the inventory, a relevance value being determined for each text in the inventory.
  • the relevance values can be saved together with the texts in the database so that they can be called up immediately.
  • text in the sense of the present invention is understood to mean any sequence of words that are separated by one or more separators (blank, period, comma, etc.) or the separation of which results from the sense of the text.
  • a text within the meaning of the present invention can thus also be a text part (a paragraph or a chapter) of a more extensive document, for example an article or a book.
  • Such a text part is also a text within the meaning of the present invention.
  • a relevance value can be determined in advance on the basis of the entire document, which comprises several text parts. Then the entire document is the text in the sense of the invention.
  • the point in time at which a text was published for the first time is defined by a time stamp of the text. This is, for example, the date the text was created.
  • the point in time at which a text was first captured by an acquisition system is defined, for example, by the point in time at which the text was first captured by a web crawler.
  • the invention relates to a computer program with program code for performing the method steps according to claim 1 when the computer program is executed in a computer.
  • the computer software can be developed in such a way that it realizes all variants of the invention according to claims 1 to 22 in connection with a processor or computer.
  • the invention relates to a method for determining the relevance of a text, which has the following steps:
  • Another aspect of the invention relates to a computer system for determining the relevance of a text, which has:
  • Means for determining the similarity of the text to texts in a collection the text being compared with each of the texts in the collection as part of an individual comparison, determining a similarity value which indicates the similarity between the two texts in each case,
  • Means for assigning the similarity value to at least that of the two texts of the respective individual comparison carried out that was published at an earlier point in time the means being designed so that the similarity value of an individual comparison is only assigned to the text that was published or recorded at an earlier point in time or the similarity value of an individual comparison is assigned to both texts of the respective individual comparison, the similarity value being assigned a lower weighting to the text of the individual comparison that was published or recorded at a later point in time
  • the means comprise, for example, a non-transitory computer-readable storage medium which stores instructions for the operation of the computer system, wherein the instructions, when executed by one or more processors of the computer system, cause one or more processors to perform operations in the computer system that are performed by fulfill the functions provided for the means mentioned.
  • the computer system interacts with a database which has a stock of texts at which at least one point in time at which the text was first published and a profile are stored.
  • the profile was created on the basis of keywords of the respective text and keyword relevance values assigned to them.
  • the means for determining the similarity of the text to texts in the inventory determine the similarity on the basis of the profiles stored in the database.
  • FIG. 1 shows a communication infrastructure which is suitable for carrying out a method for determining the relevance of a text
  • FIG. 2 shows a flow chart of a method for determining the relevance of a text
  • FIG. 3 shows an exemplary embodiment of a method for creating a standardized profile of a text in which keywords of the text are recorded and a standardized and filtered keyword relevance value is assigned to each keyword;
  • FIG. 4 shows an exemplary embodiment of a method for determining the similarity of two texts, each of which has been assigned a profile according to FIG matching keywords is determined;
  • FIG. 5 shows a flowchart of an exemplary embodiment of a search with a profile and a ranking of the hit list using a relevance value determined according to the invention
  • FIG. 6 shows a flowchart of an exemplary embodiment of the initial generation of the relevance values for all documents of an existing set of documents.
  • FIG. 7 shows a flow diagram of an exemplary embodiment of the introduction of a new document into an existing set of documents, including an update of the relevance values of all documents.
  • FIG. 1 shows a communication infrastructure which has a plurality of communication end systems Ni, Nj and a computing unit Z1.
  • the communication end systems Ni, Nj can be operated by users (not shown) and / or act autonomously. They can be connected to the processing unit Z1 via at least one communication connection, such as a telecommunication connection and / or a computer connection, for example via the Internet or an intranet.
  • the communication end systems Ni, Nj are designed, for example, as a PC, laptop, tablet computer or smartphone.
  • the computing unit Z1 can communicate with a large number of users or communication end systems Ni, Nj. It is formed, for example, by a server on the Internet.
  • the computing unit Z1 is assigned a memory unit S1 which comprises a non-volatile memory.
  • the users or communication terminals Ni, Nj used by them create or identify texts or documents D1 and send them to the processing unit Z1.
  • the terms “text” and “document” are used as synonyms in the following (although situations are also conceivable in which a text is only part of a document).
  • the computing unit Z1 creates a profile for each of the received texts D1 and stores this together with the texts D1 in the storage unit S1. Alternatively, only the profiles are saved.
  • the processing unit Z1 acts as a web crawler and automatically searches or crawls the Internet or an intranet to search for and identify texts. Depending on the application, the search can be limited to a certain type of text, e.g. news texts or texts on a specific technical, scientific or political topic.
  • the information stored in relation to a text in the storage unit S1 includes at least the following information: the point in time at which the text was first published or recorded for the first time by an acquisition system, as well as a profile of the text.
  • the documents D1 each have a time stamp which indicates when the documents were first published or recorded by a recording system.
  • the current stamp can be assigned directly to the documents, for example in the form of metadata of the document, so that this information can easily be recorded in this case and is entered in the storage unit S1.
  • the computing unit Z1 automatically searches the Internet and evaluates data from which the point in time results.
  • the point in time is entered via a communication interface by a user via a communication end system Ni, Nj.
  • the point in time can include the date and time of day on the date the text was first published or captured. If the time of day cannot be determined, the time contains at least the date.
  • the profile of the text includes keywords of the respective text, as well as keyword relevance values for the keywords of the text, the keyword relevance value indicating the relevance of the respective keyword in the respective text, as will be explained in more detail.
  • the profile can also contain further information on the respective text, for example author, publisher, etc.
  • the method for determining the relevance of a text D1 proceeds in such a way that a specific text D1 is compared with further texts Di which or their profiles are stored in the memory unit S1.
  • the text D1 can, for example, have been transmitted to the computing unit Z1 by a user via a communication end system Ni, Nj and a data transmission method.
  • the text is only identified by the user without being sent, the text including its profile already being contained in the storage unit S1. It is also conceivable that the method is carried out automatically for each text by the computing unit Z1 that the computing unit Z1 detects or crawls.
  • the arithmetic unit Z1 determines a relevance value of the text D1, which is also referred to below as the SRank value or simply as the SRank, by individual comparisons with texts Di of a stock that is stored in the storage unit S1.
  • the method used for this purpose is explained schematically below with reference to FIG.
  • a first step 201 the similarity of the text to texts in an inventory is determined.
  • the text is compared with each of the texts in the inventory as part of an individual comparison, determining a similarity value that indicates the similarity between the two texts in each case.
  • the texts of the inventory are stored in the storage unit S1.
  • the method for determining a similarity value can in principle be carried out in any way. An example of such a method is explained with reference to FIG.
  • the determined similarity value is assigned to at least that of the two texts of the respective individual comparison carried out which was published at an earlier point in time or which was recorded for the first time by an acquisition system at an earlier point in time.
  • embodiment variants can provide that the similarity value of an individual comparison is only assigned to the text that was published or recorded at an earlier point in time. This can mean that if the viewed text was later published as a comparative text or was recorded for the first time, no similarity value or the similarity value zero is assigned to it.
  • step 203 An alternative provides in step 203 that the similarity value of an individual comparison is assigned greater weight to that text that was published or recorded at an earlier point in time.
  • the similarity values determined in the individual comparisons, which were assigned to the text under consideration are added to a relevance value or SRank.
  • the size of the relevance value indicates the relevance of the text. Adding the similarity values to a relevance value SRank is only to be understood as an example of deriving the relevance value from the similarity values.
  • the ascertained SRank of the checked text D1 is stored together with or as a component of the ascertained profile of the text D1 in the storage unit S1. If the profile of the text D1 was already contained in the storage unit S1, only the SRank is additionally stored as part of the profile. Furthermore, the SRank of the document D1 under consideration can be transmitted to a communication end system Ni, Ni, as shown in FIG. 1, if necessary. This can be done with or without Document D1.
  • one embodiment variant provides that a text D1 is first sent from a communication end system Ni, Nj to the computing unit Z1, the computing unit Z1 managing a set of profiled texts that are stored in the storage unit S1.
  • the triggered request causes the processing unit Z1 to check whether the text D1 received is contained in the inventory. If this is not the case, a profile is created for the text D1 and stored in the storage unit S1 together with the point in time at which the text was first published or made accessible. Otherwise, the information already stored can be used.
  • the arithmetic unit Z1 is now prompted (based on the query that has been made) to determine the similarity of the text with the texts in the inventory, the text D1 with each of the texts Di in the inventory as part of an individual comparison while determining a similarity value that shows the similarity between the respective indicates two texts with which the method explained with reference to FIG. 2 is compared.
  • the determined or stored profile is retrieved for two texts in each case and a similarity value is formed from the number of matching keywords and the keyword relevance values assigned to these keywords.
  • the similarity values determined in the individual comparisons, which were assigned to the text D1 are added to a relevance value (the SRank), the size of which indicates the relevance of the text D1.
  • the determined relevance value can be provided and / or stored to the requesting or another communication end system Ni, Nj.
  • a first modification provides that it is not the text D1, but rather information that uniquely identifies this text, that is transmitted to the processing unit Z1.
  • a further modification provides that the relevance values (SRanks) for all documents are already available in the storage unit S1 or have already been calculated by the computing unit Z1, so that only the SRank stored in the storage unit S1 is reported in response to a request for the relevance of a document got to.
  • FIG. 3 shows by way of example how a profile is created for a given text, which profile serves as the basis for determining a similarity value when comparing the text with another text.
  • the relevant point in time for the example text D2 is 04/10/2019.
  • keywords of the text D2 are identified and extracted.
  • all names and nouns are considered as keywords in the text:
  • a second step 302 the frequencies of the keywords contained in the text D2 are determined and assigned to the text as relevance. This creates a raw profile E2 with frequencies that represent keyword relevance values:
  • a third step 303 the keyword relevance values are normalized.
  • a standardized profile E3 is created with standardized frequencies that represent standardized keyword relevance values.
  • a fourth step 304 the standardized keyword relevance values are filtered.
  • a filtered profile E4 is created. The filtering takes place through a comparison with a threshold value, which in the example under consideration is 0.6. Normalized keyword relevance values that are above the threshold value are omitted.
  • keywords of the text are assigned the frequency with which they occur in the text as relevance.
  • the corresponding keyword relevance values are normalized and filtered with a threshold value.
  • the threshold value of 0.6 given in the above example is only to be understood as an example. In principle, the threshold value can be anywhere in the range above 0 and below 1.
  • a first text D6 is then provided, from which a standardized and filtered profile A was formed in the manner described in FIG. 3, which has the following keywords and keyword relevance values: Keywords: iPad, Apple, house, table. Assigned keyword relevance values: 1, 0, 0.8, 0.8, 0.6.
  • a second text D7 is provided, from which a standardized and filtered profile B was formed in the manner described in FIG. 3, which has the following keywords and keyword relevance values: Keywords: house, bed, door, iPad. Assigned keyword relevance values: 1, 0, 0.8, 0.6, 0.4.
  • the similarity or the similarity value is determined from the two profiles A, B from the matching keywords and the filtered keyword relevance values assigned to the respective keywords by determining the sum of the mean values of the filtered keyword relevance values of the matching keywords.
  • a first step 401 the same keywords contained in the two profiles A, B are determined.
  • these are the keyword “iPad”, which contains D6 with a filtered keyword relevance value of 1.0 and the text D7 with a keyword relevance value of 0.4, and the keyword “House”, which is contained in text D6 with a filtered keyword relevance value of 0.8 and in text D7 with a filtered keyword relevance value of 1.0, see intermediate result M1 in Figure 4, which shows the profile Matches.
  • step 402 the similarity value S is determined by determining the sum of the mean values of the matching keywords for these keywords and keyword relevance values, see calculation M2 in FIG. 4. This results in the value 1.6 as the similarity value S in the example under consideration.
  • a similarity value is only assigned to the document that was published at an earlier point in time or was recorded for the first time by an acquisition system at an earlier point in time. If, for example, the document D7 is younger than the document D6 in this sense, then only the document D6 is assigned a similarity value, which is included in the subsequent calculation of a relevance value. If, on the other hand, the document D6 is younger than the document D7, it is only assigned the similarity value zero.
  • a similarity value is assigned to that text of the individual comparison with a stronger weighting that was published at an earlier point in time or was recorded for the first time by an acquisition system at an earlier point in time.
  • the determined similarity value of the individual comparison is weighted with a factor of 2 for the older document and a factor of 0.5 for the more recent document.
  • a further variant of this embodiment provides that the similarity value determined in an individual comparison is compared with a threshold value and, in the event that the similarity value is above the threshold value, the similarity value is incremented by an additional value.
  • This implemented similarity value is then assigned at least to the older of the two texts.
  • the incrementation can take place by a factor or by a summand.
  • a similarity value of 1.6 + 1.5 3.1 for document D6 results.
  • Another embodiment variant of this provides that the similarity value determined in an individual comparison is compared with a threshold value and, in the event that the similarity value is below the threshold value, the similarity value is set to zero. This again filters out documents for which the determined similarity value is below a predefined threshold of, for example, 0.5.
  • a profile is created for each of the other texts D3, D4, D5 which contains the keywords and the standardized and filtered keyword relevance values.
  • the iPad is an innovative product. ”This results in a profile with frequencies: Apple - 1; iPad - 2; Apple - 1; Product - 1.
  • the standardized profile is: iPad- 1; Apple - 0.5; Product - 0.5.
  • the filtered profile with the threshold value 0.6 results in: iPad - 1.
  • the text D5, first published on April 1st, 2019. reads: “Microsoft is a company. Microsoft is based in Seattle. ”This results in a profile with frequencies: Microsoft - 2; Seattle - 1; Company - 1.
  • the standardized profile is: Microsoft - 1; Company - 0.5; Seattle - 0.5.
  • the filtered profile is: Microsoft - 1.
  • SRank (D2) 2.83
  • Documents D3 and D4 are not given a similarity value from the similarity to document D2, since they were published later.
  • the determinations of the similarity values can be varied as explained with reference to FIG. and / or by incrementing a similarity value if it exceeds a threshold value.
  • document D2 has a significantly higher relevance value than documents D3, D4 and D5.
  • Document D3 is still somewhat more relevant than documents D4 and D5.
  • Document D4 has no relevance value because it appeared later than the other similar documents.
  • Document D5 has no relevance value because it is not similar to any other document.
  • the stems or lemmas of the names and nouns can also be extracted instead of the names and nouns.
  • the keywords can also be determined in a way other than names and nouns, for example be n-grams of the text.
  • the procedural variant considers the classification of a new document in a given inventory with successive determination of the SRank.
  • An index and an inverse index exist for the documents in the inventory, the index assigning certain keywords to a document and the inverse index making it possible to identify the documents that contain a certain keyword.
  • the storage unit S1 in FIG. 1 has an index and an inverse index of the documents it contains. It is also assumed that every document in the inventory already has a SRank. The document to be re-classified, on the other hand, does not yet have a SRank.
  • SRank when determining the relevance value SRank of the newly classified document N in design variants, it can be provided that only those similarity values determined in the individual comparisons are added or contribute to determining the similarity value, in which the similarity values are between the newly classified document N and the document of the inventory exceeds a specified threshold value, for example a threshold value of 0.5. There is then a set of hits on the basis of which the SRank is determined.
  • Said threshold value can represent an additional threshold value which is used in addition to the threshold value with which the standardized keyword relevance values are filtered when determining the similarity value.
  • the method explained on the basis of exemplary embodiments enables the processing and comparison of normal-language texts in an effective manner, with paraphrased texts also being able to be compared with one another for similarity.
  • a similarity determination between two texts can also be carried out in texts of different lengths. For example, short search phrases can be compared with extensive texts. Since many texts contain names and technical terms, the described determination of similarity values and a relevance value can also be carried out across languages.
  • a foreign-language text is first translated by means of a computer translation into the language of the text to which a similarity is to be determined.
  • FIG. 6 shows a flow chart of a search with a profile and a ranking of the hit list using a relevance value determined according to the invention.
  • the procedure consists of the following steps:
  • Step 501 A search document / text D1 is given.
  • Step 502 The profile of D1 is generated, for example in accordance with the method in FIG. 3.
  • Step 503 D1_Profil is used to search for documents Di in the inventory. This takes place, for example, according to the method of FIGS. 2 and 4. According to the method according to FIG. 4, a similarity value S is determined for two documents in each case. On the basis of the similarity values and time stamps of the documents, the relevance value SRank for document D1 is calculated according to FIG. It can be provided that in the inventory with the documents Di such documents are searched which have at least one keyword matching the profile of D1. For this purpose, for example, the matching documents are identified in an inverse document index for each keyword in the profile of D1. If there is no matching keyword, the similarity value between document D1 and the respective further document is zero, so that these documents do not need to be taken into account.
  • All matching documents (which therefore have at least one matching keyword with the profile of D1) result in a hit list T_List.
  • the relevance value SRank is calculated in the manner mentioned for all documents in the inventory hit list
  • Step 504 The documents in the hit list are sorted according to the SRank.
  • a sorting takes place taking into account further criteria. For example, a document can initially be ranked higher in the hit list, the more similar it is to the search document.
  • the SRank can then be taken into account in this ranking as a further criterion for the ranking list, for example by additionally weighting the search results with the SRank.
  • FIG. 6 shows a flow diagram of the initial generation of the relevance values for all documents in an existing set of documents.
  • the procedure includes steps: Steps 601-602: initialization of the method.
  • Steps 603-608 These steps show the actual (incremental) determination of the SRank of two documents.
  • the similarity value S is determined in accordance with the exemplary embodiment in FIG. 4, for example.
  • the similarity value S can, however, also be determined in another way.
  • the similarity value is only assigned to the respectively older document which has an earlier time stamp or was recorded at an earlier point in time.
  • Step 608 applies exclusively to the special case, which does not occur in practice or occurs only extremely rarely, in which both documents under consideration are exactly the same age.
  • steps 506-508 the terms “SRank (Di)” or “SRank (Dj)” are used in a simplified manner.
  • the relevance value SRank results from the sum of the assigned similarity values of the individual comparisons.
  • steps 506-508 only an intermediate value of the relevance value SRank is thus specified, namely an intermediate value which takes into account the similarity values of the individual comparisons up to the document Dj.
  • Steps 609-611 These steps concern the organization of the loop. When a document Di has been compared with all the other documents, i and j are incremented by the value “1” in step 611.
  • Steps 602 and 609-611 can alternatively be replaced by the following construct: Carry out steps 603-608 for all i and j from ⁇ 1, ... n ⁇ , with the condition i ⁇ j.
  • FIG. 7 shows a flow chart of the introduction of a new document into an existing stock of documents including an update of the relevance values of all documents.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un système informatique permettant de déterminer la pertinence d'un texte. Le procédé comprend les étapes suivantes consistant à : déterminer (201) la similarité du texte (D1, D2) avec des textes (Di) d'un stock, le texte (D1, D2) étant comparé à chacun des textes (Di) du stock dans le contexte d'une comparaison individuelle par détermination d'une valeur de similarité (S) qui indique la similarité entre les deux textes respectifs ; attribuer (202) la valeur de similarité (S) d'au moins l'un des deux textes de la comparaison individuelle effectuée qui a été publié à un instant antérieur ou qui a été préalablement acquis à un instant antérieur par un système d'acquisition ; calculer une valeur de pertinence (SRank) à partir des valeurs de similarité (S) qui ont été déterminées dans les comparaisons individuelles et ont été attribuées au texte (D1, D2), et stocker la valeur de pertinence calculée (SRank) et/ou transmettre la valeur de pertinence calculée (SRank) par l'intermédiaire d'un réseau informatique à un système d'extrémité de communication (Ni, Nj).
EP21717412.7A 2020-04-09 2021-04-07 Procédé et système informatique pour déterminer la pertinence d'un texte Pending EP4133384A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102020109953.3A DE102020109953A1 (de) 2020-04-09 2020-04-09 Verfahren und Computersystem zur Bestimmung der Relevanz eines Textes
PCT/EP2021/059021 WO2021204849A1 (fr) 2020-04-09 2021-04-07 Procédé et système informatique pour déterminer la pertinence d'un texte

Publications (1)

Publication Number Publication Date
EP4133384A1 true EP4133384A1 (fr) 2023-02-15

Family

ID=75438780

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21717412.7A Pending EP4133384A1 (fr) 2020-04-09 2021-04-07 Procédé et système informatique pour déterminer la pertinence d'un texte

Country Status (4)

Country Link
US (1) US20230185837A1 (fr)
EP (1) EP4133384A1 (fr)
DE (1) DE102020109953A1 (fr)
WO (1) WO2021204849A1 (fr)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8631006B1 (en) * 2005-04-14 2014-01-14 Google Inc. System and method for personalized snippet generation
US9292589B2 (en) * 2012-09-04 2016-03-22 Salesforce.Com, Inc. Identifying a topic for text using a database system
US9467409B2 (en) * 2013-06-04 2016-10-11 Yahoo! Inc. System and method for contextual mail recommendations
JP6099592B2 (ja) * 2014-03-27 2017-03-22 富士フイルム株式会社 類似症例検索装置及び類似症例検索プログラム
US10235783B2 (en) * 2016-12-22 2019-03-19 Huawei Technologies Co., Ltd. System and method for visualization of a compute workflow
CN108170684B (zh) * 2018-01-22 2020-06-05 京东方科技集团股份有限公司 文本相似度计算方法及系统、数据查询系统和计算机产品
CN109871428B (zh) * 2019-01-30 2022-02-18 北京百度网讯科技有限公司 用于确定文本相关度的方法、装置、设备和介质
US10997469B2 (en) * 2019-09-24 2021-05-04 Motorola Solutions, Inc. Method and system for facilitating improved training of a supervised machine learning process
US11429603B2 (en) * 2020-01-07 2022-08-30 Dell Products L.P. Using artificial intelligence and natural language processing for data collection in message oriented middleware frameworks

Also Published As

Publication number Publication date
WO2021204849A1 (fr) 2021-10-14
DE102020109953A1 (de) 2021-10-14
US20230185837A1 (en) 2023-06-15

Similar Documents

Publication Publication Date Title
DE69822687T2 (de) Vorrichtung und Verfahren zur Zusammenfassung
DE69833238T2 (de) System zur Schlüsselwortgewinnung und Textwiederauffingungssystem zu seiner Verwendung
DE69834386T2 (de) Textverarbeitungsverfahren und rückholsystem und verfahren
DE112018000334T5 (de) System und Verfahren zur domänenunabhängigen Aspektebenen-Stimmungserkennung
DE69933187T2 (de) Dokumentensuchverfahren und Dienst
DE102005051617B4 (de) Automatisches, computerbasiertes Ähnlichkeitsberechnungssystem zur Quantifizierung der Ähnlichkeit von Textausdrücken
DE102007037646B4 (de) Computerspeichersystem und Verfahren zum Indizieren, Durchsuchen und zur Datenwiedergewinnung von Datenbanken
DE102013205737A1 (de) System und Verfahren zum automatischen Erkennen und interaktiven Anzeigen von Informationen über Entitäten, Aktivitäten und Ereignisse aus multimodalen natürlichen Sprachquellen
DE102014113870A1 (de) Identifizieren und Anzeigen von Beziehungen zwischen Kandidatenantworten
DE112007000053T5 (de) System und Verfahren zur intelligenten Informationsgewinnung und -verarbeitung
DE102006040208A1 (de) Patentbezogenes Suchverfahren und -system
EP1779263A1 (fr) Dispositif d'analyse vocale et textuelle et procede correspondant
DE10031351A1 (de) Verfahren zur automatischen Recherche
WO2021032824A1 (fr) Procédé et dispositif de présélection et de détermination de documents similaires
DE112020005268T5 (de) Automatisches erzeugen von schema-annotationsdateien zum umwandeln von abfragen in natürlicher sprache in eine strukturierte abfragesprache
DE202013005812U1 (de) System zum indexieren elektronischer Inhalte
DE112010002620T5 (de) Ontologie-nutzung zum ordnen von datensätzen nachrelevanz
DE10034694A1 (de) Verfahren zum Vergleichen von Suchprofilen
WO2010078859A1 (fr) Procédé pour déterminer une similarité entre des documents
WO2011044865A1 (fr) Procédé de détermination d'une similarité d'objets
EP4133384A1 (fr) Procédé et système informatique pour déterminer la pertinence d'un texte
DE10057634C2 (de) Verfahren zur Verarbeitung von Text in einer Rechnereinheit und Rechnereinheit
EP2193457A1 (fr) Détection de corrélations entre des données représentant des informations
DE102012219878A1 (de) Intranetsuche, Suchmaschine und Datenstationseinrichtung
WO2013075745A1 (fr) Procédé et système d'élaboration de modèles d'utilisateurs

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220920

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)