WO2022165634A1

WO2022165634A1 - Apparatus and method for type matching of a text sample

Info

Publication number: WO2022165634A1
Application number: PCT/CN2021/074859
Authority: WO
Inventors: Georgios Stoilos; Nikos PAPASARANTOPOULOS; Pavlos VOUGIOUKLIS; Patrik BANSKY; Yantao Jia
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2022-08-11

Abstract

An apparatus (100) for processing a text sample (101) is disclosed. The apparatus (100) is configured to tokenize the text sample (101) and generate a plurality of n-grams based on the plurality of tokens. The apparatus (100) is further configured to determine for each n-gram, using a term-based matching scheme (107), one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types and, using a vector-based matching scheme (109), one or more second scores and one or more second candidate types. Moreover, the apparatus (100) is configured to determine for each n-gram, using a sequence tagging model (111), a third score and to determine one or more composite scores based on the one or more first scores, the one or more second scores and the third score. The apparatus (100) is further configured to rank the plurality of n-grams based on the one or more composite scores and may output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.

Description

Apparatus and method for type matching of a text sample

TECHNICAL FIELD

The present disclosure relates to information processing technology. More specifically, the present disclosure relates to an apparatus and method for type matching of a text sample.

BACKGROUND

Annotating text or text samples, such as documents, queries and the like with Knowledge Base resources like entities, relations, and types is an important step for many downstream tasks like semantic search and question answering. Although entity and relation linking have been extensively studied in the past, less attention has been spent on type linking and matching even though the problem is challenging in many domains.

SUMMARY

It is an objective of the present disclosure to provide an improved apparatus and method for type matching of a text sample.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect an apparatus for processing, in particular type matching a text sample is provided. The text sample may, for instance, be a user search query and the apparatus based on the type matching may return one or more query results in response to the search query.

The apparatus is configured to parse a plurality of tokens (also referred to as terms) from the text sample, i.e. tokenize the text sample, and to generate a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens.

The apparatus is further configured to determine for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of a semantic similarity between the n-gram and the one or more first candidate types.

The apparatus is further configured to determine for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein (like the one or more first scores) the one or more second scores are indicative of a semantic similarity between the n-gram and the one or more second candidate types.

The apparatus is further configured to determine for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type.

Moreover, the apparatus is configured to determine for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score and to rank the plurality of n-grams based on the one or more composite scores. One or more of the highest scoring n-grams and a list of the associated composite scores and candidate types may be provided as the output of the apparatus.

In a further possible implementation form of the first aspect, the sequence tagging model defines for each of the plurality of tokens of the text sample a probability that the respective token is a type, i.e. belongs to the category TYPE.

In a further possible implementation form of the first aspect, the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third score as a weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE.

In a further possible implementation form of the first aspect, the apparatus is configured to determine the weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE by using a negative weight for those of the one or more tokens of the n-gram that are a non-type, i.e. belong to a category other than TYPE.

In a further possible implementation form of the first aspect, the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third score based on those probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE that exceed a threshold probability. For instance, the apparatus may be configured to determine the third score based on the probabilities of the three tokens having the largest probabilities.

In a further possible implementation form of the first aspect, the one or more composite scores are one or more weighted sums of the one or more first scores, the one or more second scores and the third score of each of the plurality of n-grams.

In a further possible implementation form of the first aspect, the apparatus is further configured to:

determine for each of the plurality of n-grams, using a learning-based matching scheme based on a feed-forward neural network, one or more fourth scores and one or more fourth candidate types from the pre-defined vocabulary of candidate types, wherein the one or more fourth scores are indicative of a semantic similarity between the n-gram and the one or more fourth candidate types; and

determine for each of the plurality of n-grams the one or more composite scores based on the one or more first scores, the one or more second scores, the third score and the one or more fourth scores, in particular as a weighted sum of these scores.

In a further possible implementation form of the first aspect, for determining the one or more fourth scores for each of the plurality of n-grams, using the learning-based matching scheme, the apparatus is configured to:

determine an embedding vector for the text sample from the probabilities of each of the one or more of its tokens using a first neural network;

determine an embedding vector for the n-gram and a plurality of further embedding vectors for the plurality of candidate types from the pre-defined vocabulary of candidate types; and

feed the embedding vector for the text sample, the embedding vector for the n-gram and the further plurality of embedding vectors into a second feed-forward neural network for determining the one or more fourth scores.

In a further possible implementation form of the first aspect, for determining the one or more first scores for each of the plurality of n-grams using the term-based matching scheme the apparatus is further configured to normalize the one or more first scores for each of the n-grams based on the number of tokens of the respective n-gram.

In a further possible implementation form of the first aspect, for determining the one or more second scores for each of the plurality of n-grams using the vector-based matching scheme the apparatus is configured to:

determine an embedding vector for the n-gram;

determine a plurality of similarity scores, e.g. the cosine similarity scores, between the embedding vector for the n-gram and a plurality of further embedding vectors, wherein the plurality of further embedding vectors are based on the plurality of candidate types from the pre-defined dictionary of candidate types; and

determine one or more of the largest similarity scores as the one or more second scores for the respective n-gram.

In a further possible implementation form of the first aspect, for each n-gram the apparatus is configured to determine the embedding vector for the n-gram as a weighted sum of a plurality of token vectors, wherein each token vector is associated with one of the one or more tokens of the n-gram.

In a further possible implementation form of the first aspect, the text sample is a non-segmented text sample, i.e. in a non-segmented language, such as Chinese, Japanese, Thai and the like, wherein the apparatus is further configured to segment the text sample into the plurality of tokens.

concatenate the tokens of the n-gram for obtaining a composite token comprising a continuous string of characters;

retrieve a composite vector associated with the composite token from a dictionary of vectors; and

determine the embedding vector as the average vector of the composite vector and the weighted sum of the plurality of token vectors.

In a further possible implementation form of the first aspect, the text sample is a user search query, wherein the apparatus is further configured to provide one or more query results based on one or more of the plurality of n-grams ranked based on the one or more composite scores.

According to a second aspect a method for processing, in particular type matching a text sample is provided. The method comprises the steps of:

parsing a plurality of tokens from the text sample, i.e. tokenizing the text sample;

generating a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;

determining for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of a semantic similarity between the n-gram and the one or more first candidate types;

determining for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein the one or more second scores are indicative of a semantic similarity between the n-gram and the one or more second candidate types; determining for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type;

determining for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and

ranking the plurality of n-grams based on the one or more composite scores.

The method according to the second aspect may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.

The type matching method according to the second aspect can be performed by the type matching apparatus according to the first aspect. Thus, further features of the type matching method according to the second aspect result directly from the functionality of the type matching apparatus according to the first aspect as well as its different implementation forms and embodiments described above and below.

According to a third aspect a computer program product is provided, comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the type matching method according to the second aspect, when the program code is executed by the computer or the processor.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 shows a schematic diagram of an apparatus for type matching of a text sample according to an embodiment;

Fig. 2 shows a schematic diagram illustrating processing blocks implemented by an embedding engine of a type matching apparatus according to an embodiment;

Fig. 3 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to an embodiment for determining one or more composite scores;

Fig. 4 shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of first scores for the composite scores of figure 3 using a term-based matching scheme;

Fig. 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of second scores for the composite scores of figure 3 using a vector-based matching scheme;

Fig. 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to two different embodiments for determining a plurality of third scores for the composite scores of figure 3 using a sequence tagging model;

Fig. 7 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to a further embodiment for determining one or more composite scores;

Fig. 8a shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme;

Fig. 8b shows a schematic diagram illustrating the architecture of a neural network for providing a vector representation that is used by the learning-based matching scheme of figure 8a as implemented by a type matching apparatus according to an embodiment; and

Fig. 8c shows a schematic diagram illustrating the architecture of an alternative neural network for providing a vector representation that is used by the learning-based matching scheme of figure 8a as implemented by a type matching apparatus according to an embodiment; and

Fig. 9 is a flow diagram illustrating a type matching method according to an embodiment.

In the following, identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps) , even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units) , even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 shows a schematic diagram of an apparatus 100 for type matching of a text sample according to an embodiment. In an embodiment, the apparatus 100 may be implemented as a cloud server 100 configured to receive a text sample 101 via the internet. In the embodiment shown in figure 1, the text sample 101 is a search query 101, namely the exemplary search query “London Bridge Spa” 101 indicating that a user is searching for a “Spa” near “London Bridge” .

As illustrated in figure 1, the type matching apparatus 100 may implement in software and/or hardware a tokenizer engine 103, an embedding interface 105, a term-based matching scheme or engine 107 (referred to as “termTypeMatcher” 107 in figure 1) , a vector-based matching scheme or engine 109 (referred to as “vecTypeMatcher” 109 in figure 1) , a sequence tagging model or engine 111 (referred to as “QueryTagger” in figure 1) , a learning-based matching scheme or engine 115 (referred to as “learningTypeMatcher” 115 in figure 1) and/or a ranking engine 113 (referred to as “Ranker” in figure 1) . For implementing one or more of these components, which will be described in more detail below, in software and/or hardware the apparatus 100 may comprise, for instance, one or more processors for processing data, a communication interface for receiving and transmitting data, and a memory for storing data. The one or more processors of the apparatus 100 may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs) , field-programmable arrays (FPGAs) , digital signal processors (DSPs) , or general-purpose processors. The memory may be configured to store executable program code which, when executed by the one or more processors, causes the type matching apparatus 100 to perform the functions and methods described herein.

Figure 2 shows a schematic diagram illustrating processing blocks implemented by the embedding engine 105 of the apparatus 100 of figure 1according to an embodiment. Generally, the embedding engine 105 is configured to generate based on an input, such as a n-gram, an embedding vector of the n-gram. In the example shown in figure 2 the n-gram consists of two tokens. The upper processing branch in figure 2 may implement a conventional approach for determining sentence embeddings which consists of retrieving weights, in particular IDFs for each token (processing block 201) , vectors for each token (processing block 207) and computing the weighted average (processing block 209) . In the embodiment shown in figure 2, however, a processing block 203 is provided that is configured to further penalize some words that are frequent in the pre-defined dictionary. The lower processing branch in figure 2 allows handling of an n-gram from a text sample 101 from a language that requires segmentation, such as Chinese, Japanese, Thai, Korean and the like. In a processing block 205 the n-grams are concatenated again and, if a vector for the concatenation exists, then in a processing block 211 the mean of this vector with the vector computed in the upper processing branch of figure 2 may be computed. This approach improves the accuracy in many languages that require segmentation. Moreover, it reduces the space of queries to be evaluated and results to be considered and hence also the execution time.

As will be described in more detail below under further reference to the figures 3, 4, 5a, 5b and 6a, 6b, the apparatus 100, for instance, by means of the tokenizer engine 103 is configured to parse a plurality of tokens (also referred to as terms) from the text sample 101, i.e. to tokenize the text sample 101 (see also processing step 305 in figure 3) , and to generate a plurality of n-grams based on the plurality of tokens (see also processing step 307 in figure 3) , wherein each n-gram comprises one or more of the plurality of tokens. In an embodiment, the tokenizer engine 103 may implement standard segmentation techniques (such as PyThaiNLP, Jieba, CoreNLP, JapanezeTokenizer) and/or the known “TokTok” tokenizer.

The apparatus 100 is further configured to determine, using the term-based matching scheme or engine 107, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of first scores and an associated plurality of first candidate types from a pre-defined vocabulary of candidate types, as illustrated by processing step 309 in figure 3, which will be described in more detail in the context of figure 4 below. Each first score of the plurality of first scores is indicative of the respective semantic similarity between the n-gram and the respective first candidate type of the plurality of first candidate types.

As will be described in more detail below in the context of figure 4, in an embodiment, the term-based matching scheme or engine 107 is configured to link each type in the given pre-defined vocabulary of candidate types to an article in a Knowledge Base, such as Wikipedia that closely describes its meaning. For example, for the type “FitnessClub” the Wikipedia article https: //en. wikipedia. org/wiki/Health_club may be used. In this way the Wikipedia article brings synonyms and other related words relevant to the type like “aerobics” , “cycling (spinning) ” , “boxing” , “step yoga” , “regular yoga and hot (Bikram) yoga” , “pilates” that a user can use at query time to refer to that particular type. Then, the input query n-grams are treated by the term-based matching scheme or engine 107 as queries issued over the index in order to link them to actual types. In an embodiment, the term-based matching scheme or engine 107 may score these n-grams according to known ranking models like BM25 or MLM for providing the plurality of first scores. In order to alleviate a bias to give higher scores to longer n-grams, in an embodiment, the term-based matching scheme or engine 107 may be further configured to normalize the plurality of first scores. In an embodiment, the term-based matching scheme or engine 107 may implement a normalization of the plurality of first scores on the basis of the following equations:

where Q _n is an n-gram,

is a candidate type, len () is a function returning the length of the n-gram, and ngr () is a function returning the set of all n-grams of a query.

The apparatus 100 is further configured to determine, using the vector-based matching scheme or engine 109, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of second scores and an associated plurality second candidate types from the pre-defined vocabulary of candidate types, as illustrated by processing step 311 in figure 3, which will be described in more detail in the context of figures 5a and 5b below. Like the plurality of first scores, each score of the plurality of second scores is indicative of the respective semantic similarity between the n-gram and the respective second candidate type of the plurality of second candidate types.

The apparatus 100 is further configured to determine, using the sequence tagging model or engine 111, for each of the plurality of tokens of the input text sample 101 provided by the tokenizer engine 103 a probability distribution for all the tokens. In an embodiment, the apparatus 100 may implement a conventional sequence tagging model or engine 111, such as Conditional Random Fields (CRFs) , RNNs (LSTMs or GRUs) or Transformer-based sequence tagging, for providing the probability distribution for each token. In an embodiment, the sequence tagging model or engine 111 is configured to determine for a given set of categories and the text sample Q 101 the probability distribution on the basis of the following (soft-max) equation:

where ti is a token in the tokenized text sample, Q is the text sample, ck is a particular category, hi a vector computed by any of the conventional sequence tagging models mentioned above and exp is the natural exponential function.

Moreover, the apparatus 100 is configured to determine for each of the plurality of n-grams a respective third score based on the probability distribution for all the tokens provided by the sequence tagging model or engine 111, wherein the third score is indicative of the degree in which each n-gram represents a type, and then a plurality of composite scores based on the plurality of first scores, the plurality of second scores and the third score, as illustrated in processing step 315 of figure 3. In the embodiment illustrated in figure 3, the apparatus 100 is configured to determine the plurality of composite scores as a respective weighted sum of a respective first score, a respective second score and a respective third score of each of the plurality of n-grams. In an embodiment, the plurality of composite score, i.e. final scores may be computed by the ranker engine 113.

Furthermore, the apparatus 100 is configured, using the ranker engine 113, to rank the plurality of n-grams based on the plurality of composite scores (or final scores) , as illustrated in processing block 317 of figure 3. Thus, as described above, the ranker engine 113 is configured to perform a ranking based on a combination of the probability scores, i.e. third scores provided by the sequence tagging model or engine 111 and the first and second scores provided by the term-based matching scheme or engine 107 and the vector-based matching scheme or engine 109, respectively.

In an embodiment that will be described in more detail below in the context of figure 6a, the ranker engine 113 may be configured to use a mixture model such that given an n-gram [n1 n2 …nm] the ranker engine 113 sums the probabilities of the tokens that are annotated as [TYPE] , i.e. belong to the category [TYPE] and subtracts the probabilities of tokens that are annotated using a negative set of tags. In a further embodiment that will be described in more detail below in the context of figure 6b, the ranker engine 113 may be configured to check the probabilities of the annotations for each token in the 2 ^nd and 3 ^rd ranks of the probability distributions provided by the sequence tagging model or engine 111.

Once the plurality of n-grams have been ranked by the ranker engine 113 , one or more of the highest scoring n-grams and a list of the associated composite scores, i.e. final scores and candidate types may be provided as the output of the apparatus 100, as illustrated in processing block 319 of figure 3.

As already described above, figure 3 shows a schematic diagram illustrating processing blocks implemented by the type matching apparatus 100 according to an embodiment. As already mentioned, the input text sample 101 may be a user question or query and the output may be a ranked list of triples of candidate mentions and types with a confidence score of the form <ni, ti, si>. Figure 3 depicts a hybrid pipeline for type detection and linking, i.e. type matching which is configured to detect which part (n-gram) of the input text sample 101 corresponds to a type and also match that part to the most related dictionary type. As already described above, the processing chain illustrated in figure 3 first checks what is the language of the input text sample (processing block 301) . If the language is one from the list of languages requiring segmentation it calls an external segmentation engine (processing block 303) suitable for the particular language such as PyThaiNLP (Thai) , CoreNLP/Jieba (Chinese) , or JapaneseTokenizer (Japanese) . The processing chain then proceeds to further tokenize the sentence (processing block 305) using tokenizers known in the art such as TokTok and computes the n-grams of the tokenized text sample (processing block 307) . As already described above, the processing chain then makes use of the term-based matching scheme or engine 107 (an implementation of which will be described in more detail in the context of figure 4) and the vector-based matching scheme or engine 109 (an implementation of which will be described in more detail in the context of figures 5a and 5b) for detection and matching which assign first and second scores s1 and s2 (respectively) to each candidate mention and type. Moreover, as already described above, the sequence tagging model or engine 111 (processing block 313) is used to compute a probability distribution for each of the tokens. Subsequently, the processing chain illustrated in figure 3 computes using the probability distribution the third score, i.e. the tagging score s3 for each n-gram, and a composite score from the first, second, and third score (processing block 315) .

As already described above, figure 4 shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of first scores using the term-based matching scheme or engine 107. The input in figure 4 is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. first score of the form <ni, ti, si>. In the embodiment shown in figure 4, the term-based matching scheme or engine 107 is configured to use an inverted index (processing block 405) which associates a type ti to at least one external document (e.g., a Wikipedia document) that best describes the meaning of the type. The association may be performed using a semi-automated mapping or alignment algorithm (processing block 407) between the external document corpus (block 409) and the type dictionary. Each n-gram ni is used as a free text query in order to retrieve candidate types ti for ni (processing block 402) . Each retrieved candidate is associated with a ranking score computed by the full-text engine (processing block 403) (common ranking models include, but are not limited to BM25, MLM, and more) . The first scores returned by the full text engine may be normalized according to the maximum score returned and the length of each n-gram (see

processing blocks

411 and 413 of figure 4) . Finally, the candidate triples are ranked and returned to the further processing blocks of figure 3 (see

processing blocks

415 and 417 of figure 4) .

As already described above, figures 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of second scores using the vector-based matching scheme or engine 109. The input of the processing chain illustrated in figure 5a is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. second score of the form <ni, ti, si>. For each input n-gram the processing block 501 calls the processing blocks shown in figure 5b to compute a respective embedding vector for each n-gram. Furthermore, in a pre-processing step the processing chain shown in figure 5a also loads and computes a vector embedding for each type in the pre-defined dictionary (see

processing blocks

503 and 505 of figure 5a) . The processing proceeds by computing the cosine similarity between each n-gram vector and each type-vector in processing block 507. The cosine similarity also assigns a similarity (confidence) score between each pair. The algorithm finally ranks the candidate triples and returns them to the processing blocks of figure 3 (see

processing blocks

509 and 511 of figure 5a) .

The input of the processing chain illustrated in figure 5b is an n-gram 101 bconsisting of a list of tokens and the output is an embedding of the n-gram into a word vector space, i.e. an embedding vector. As already described above, in the case of a text sample 101 in a segmented language, the upper processing branch of figure 5b is using a weighted average formula to compute the embedding vector (see in particular processing block 535 of figure 5b) . Such a weighted average can be computed using for each token of the n-gram its idf (inverse document frequency) computed in a large document corpus (see

blocks

521, 523, and 529) and a pre-computed vector (see block 533) that can be found in vector dictionaries such as fasttext, word2vec, GloVe, Numberbatch, BPEmb. However, a list of common tokens mined from the type dictionary (see blocks 525 and 527) may be used for which their weight is further reduced (see

processing blocks

525 and 531 of figure 5b) . As already described above, the vector-based matching scheme or engine 109 may implement a further processing branch (in figure 5b the lower processing branch) for determining embedding vectors for words in languages that require segmentation, such as Chinese, Japanese, Thai, Korean and the like. This alternative processing branch includes concatenating the n-gram tokens without using spaces (see processing block 539 of figure 5b) , retrieving a vector for the concatenation (see processing block 541 of figure 5b) and finally taking the mean of the retrieved vector with the one computed using the weighted average approach (see processing block 549 of figure 5b) . In languages not requiring segmentation this additional vector may be set to the zero vector (block 543) so then then processing chain proceeds to decide and return the weighted average vector (blocks 545 and 547) computed as described above.

As already described above, figures 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to two different embodiments for determining the plurality of third scores using a sequence tagging model or engine 111. The input of the processing chain illustrated in figure 6a is a single n-gram 101 b and the probability distribution computed by the sequence tagger (see blocks 601 and 603) . The output is a third score about the confidence that this n-gram represents a type (see processing block 611 of figure 6a) . The score is computed as a weighted average of the probability that each individual token of that n-gram is believed to be a type or is believed to be of some other category. The processing chain considers for each token the most likely category predicted by the sequence tagger (see processing block 607) and tokens annotated with type (T) are counted positive whereas tokens annotated with categories from a list of negative categories are counted negatively (see

processing blocks

605 and 609 of figure 6a) .

In one embodiment, depicted in Figure 6b, the processing scheme illustrated in figure 6a is further refined by using a more complex logic. The main difference with the processing scheme shown in figure 6a is that, for instance, the top-3 ranks of probabilities and category candidates are used in order to compute a score for the given n-gram based again on its individual tokens and the probability distribution computed by the sequence tagger (see blocks 621 and 623) . The processing chain initially sets the third score to 0 (see block 625) and then iterates over the tokens of the n-gram (see blocks 627 and 631) counting positively tokens annotated as types in the first rank (blocks 633 and 635) , or annotated as types in the second rank (see blocks 637 and 639) , or annotated as types in the third rank but provided that neither in the first nor in the second rank they are annotated with one of the negative categories (see blocks 641 and 643) . In addition, the scheme counts negatively tokens annotated with one of the negative categories in the first rank (see blocks 645 and 647) or annotated with one of the negative categories in the second rank (see blocks 649 and 651) . In any of the preceding steps the probability used to count positively or negatively in the computation of the third score may be scaled according to the rank that is used (see

blocks

635, 639, 643, 647, and 651) . Categories of lower rank count less in the overall third score, whereas categories of higher ranks more. When all tokens of the n-gram have been processed the third score is returned (see block 629) .

Figure 7 shows a variant of the processing scheme of figure 3 that may be implemented by the type matching apparatus 100 according to a further embodiment. The main difference with respect to the processing scheme shown in figure 3 is that in addition to the term-based (block 709) and the vector-based schemes (block 715) a learning-based matching schemeis used (see processing block 711) to compute a fourth score. In addition, to figure 3 this fourth score is considered by processing block 717 in order to compute a composite score. For otherwise the rest of the processing blocks and the chain of process is the same as that in figure 3, and is briefly summarized as checking if input language requires segmentation (block 701) and if necessary segmenting it (block 703) , tokenizing the input text sample (block 705) , computing the n-grams (block 707) , computing the first score (block 709) , the second score (block 715) and the probability distribution (block 713) , computing a composite score from all scores (block 717) , ranking composite scores of n-grams (block 719) and returning the top ranked n-grams (block 721) .

Figure 8a shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme. The input of the processing chain is the list of n-grams, the list of types from the dictionary and the probability distribution as computed by the sequence tagger 111. The processing chain illustrated in figure 8a retrieves the computed probability distribution (block 803) from the sequence tagger 111 (block 801) and uses it to compute a vector representation g for the input sentence (block 805) using a first neural network architecture. This architecture can be any of those known in the art such as and not limited to ConvNets, NeuralNetworks using weighted sums and non-linearities, sequence-to-sequence, transformers, attention, and more and particular examples of implementations are given next. In addition, the processing chain illustrated in figure 8a computes embedding vectors for the n-grams (block 807) and embedding vectors for the plurality of types (blocks 811 and 809) using the procedure of figure 5b that has been described in detail further above. All vector representations (the sentence vector g computed in 805, the n-gram and the type) are concatenated and multiplied by a matrix (see block 813) that represents a second neural network. The chain proceeds to compute the energy of the final vector as the L1 or L2 norm (block 815) , uses the energies as scores to rank the candidates (block 817) and finally returns the highest rank n-gram and candidate as the fourth score.

Figure 8b shows a schematic diagram illustrating a possible architecture of a first neural network for providing a vector representation from the probability distribution in processing block 805 of the learning-based matching scheme of figure 8a as implemented by the type matching apparatus 100 according to an embodiment. The figure shows a particular implementation of a ConvNet network (see block 833) for computing a vector representation. The probability distribution from the sequence tagger 111 is arranged as a matrix of dimensions λxm where λ is the length of the input sentence (block 835) and m is the number of categories. This matrix is convoluted with a plurality of filters of size kxm, also called a volume (block 837) and the output is flattened and further processed by a fully-connected layer (block 839) . The final vector is returned to the further processing blocks of figure 8.

Figure 8c shows a schematic diagram illustrating the architecture of another possible implementation of a first neural network for providing a vector representation in processing block 805 of the learning-based matching scheme as implemented by the type matching apparatus 100 according to an embodiment. This is a particular example of a weighted average and non-linearity network (block 843) . The probabilities that each token belongs to each category are used to compute a weighted average which is then processed by a non-linear function such as but not limited to ReLu and leaky-Relu. The final vector is returned to the further processing blocks of figure 8.

Figure 9 is a flow diagram illustrating different steps of a method 900 for processing, in particular type matching the text sample 101 is provided. The method 900 comprises the steps of:

parsing 901 a plurality of tokens from the text sample 101, i.e. tokenizing the text sample 101;

generating 903 a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;

determining 905 for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of the semantic similarity between the n-gram and the one or more first candidate types;

determining 907 for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein the one or more second scores are indicative of the semantic similarity between the n-gram and the one or more second candidate types;

determining 909 for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type;

determining 911 for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and

ranking 913 the plurality of n-grams 401 based on the one or more composite scores.

The type matching method 900 may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.

The type matching method 900 may be performed by the type matching apparatus 100 described above. Thus, further features of the type matching method 900 result directly from the functionality of the type matching apparatus 100 and its different embodiments described above and below.

The person skilled in the art will understand that the "blocks" ( "units" ) of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step) .

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims

An apparatus (100) for processing a text sample (101) , wherein the apparatus (100) is configured to:

parse a plurality of tokens from the text sample (101) ;

generate a plurality of n-grams (401) based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;

determine for each of the plurality of n-grams (401) , using a term-based matching scheme (107) , one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types;

determine for each of the plurality of n-grams (401) , using a vector-based matching scheme (109) , one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types;

determine for each of the plurality of n-grams (401) , using a sequence tagging model (111) , a third score;

determine for each of the plurality of n-grams (401) one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and

rank the plurality of n-grams (401) based on the one or more composite scores.
The apparatus (100) of claim 1, wherein the sequence tagging model (111) defines for each of the plurality of tokens of the text sample (101) a probability that the token is a type.
The apparatus (100) of claim 2, wherein the apparatus (100) is configured to determine for each of the plurality of n-grams (401) , using the sequence tagging model (111) , the third score as a weighted average of the probabilities that each of the one or more tokens of the n-gram is a type.
The apparatus (100) of claim 3, wherein the apparatus (100) is configured to determine the weighted average of the probabilities that each of the one or more tokens of the n-gram is a type by using a negative weight for those of the one or more tokens of the n-gram that are a non-type.
The apparatus (100) of claim 2, wherein the apparatus (100) is configured to determine for each of the plurality of n-grams (401) , using the sequence tagging model (111) , the third score based on those probabilities that each of the one or more tokens of the n-gram is a type that exceed a threshold probability.
The apparatus (100) of any one of the preceding claims, wherein the one or more composite scores are one or more weighted sums of the one or more first scores, the one or more second scores and the third score of each of the plurality of n-grams.
The apparatus (100) of any one of the preceding claims, wherein the apparatus (100) is further configured to:

determine for each of the plurality of n-grams (401) , using a learning-based matching scheme (711) , one or more fourth scores and one or more fourth candidate types from the pre-defined vocabulary of candidate types; and

determine for each of the plurality of n-grams (401) the one or more composite scores based on the one or more first scores, the one or more second scores, the third score and the one or more fourth scores.
The apparatus (100) of claim 7, wherein for determining the one or more fourth scores for each of the plurality of n-grams (401) using the learning-based matching scheme the apparatus (100) is configured to:

determine an embedding vector for the text sample from the probabilities of each of the one or more of its tokens using a first neural network (833) ;

determine an embedding vector for the n-gram and a plurality of further embedding vectors for the plurality of candidate types from the pre-defined vocabulary of candidate types; and

feed the embedding vector for the text sample, the embedding vector for the n-gram and the further plurality of embedding vectors into a second feed-forward neural network (813) for determining the one or more fourth scores.
The apparatus (100) of any one of the preceding claims, wherein for determining the one or more first scores for each of the plurality of n-grams using the term-based matching scheme (107) the apparatus (100) is configured to:

search based on the one or more tokens of the n-gram for the one or more first candidate types using an inverted index mapping tokens to types;

determine a search score for each of the one or more first candidate types; and

determine one or more of the largest search scores of the one or more first candidate types as the one or more first scores for the n-gram.
The apparatus (100) of claim 9, wherein the apparatus (100) is configured to generate the inverted index by:

associating each of the plurality of first candidate types with one or more documents from a knowledge base describing the respective first candidate type by means of one or more tokens; and

generating the inverted index by mapping each of the one or more tokens of the one or more documents from the knowledge base to the respective first candidate type.
The apparatus (100) of claim 9 or 10, wherein for determining the one or more first scores for each of the plurality of n-grams (401) using the term-based matching scheme (107) the apparatus (100) is further configured to normalize the one or more first scores for each of the n-grams (401) based on the number of tokens of the respective n-gram.
The apparatus (100) of any one of the preceding claims, wherein for determining the one or more second scores for each of the plurality of n-grams (401) using the vector-based matching scheme (109) the apparatus (100) is configured to:

determine an embedding vector for the n-gram;

determine a plurality of similarity scores between the embedding vector for the n-gram and a plurality of further embedding vectors, wherein the plurality of further embedding vectors are based on the plurality of second candidate types from the pre-defined dictionary of candidate types; and

determine one or more of the largest similarity scores as the one or more second scores for the n-gram.
The apparatus (100) of claim 12, wherein for each n-gram the apparatus (100) is configured to determine the embedding vector for the n-gram as a weighted sum of a plurality of token vectors, wherein each token vector is associated with one of the one or more tokens of the n-gram.
The apparatus (100) of claim 13, wherein the text sample is a non-segmented text sample and wherein the apparatus is further configured to segment the text sample into one or more tokens.
The apparatus (100) of claim 14, wherein the apparatus (100) is further configured to:

concatenate the tokens of the n-gram for obtaining a composite token comprising a continuous string of characters;

retrieve a composite vector associated with the composite token from a dictionary of vectors;

determine the embedding vector as the average vector of the composite vector and the weighted sum of the plurality of token vectors.
The apparatus (100) of any one of the preceding claims, wherein the text sample (101) is a query (101) and wherein the apparatus (100) is configured to provide one or more query results based on one or more of the plurality of n-grams ranked based on the one or more composite scores.
A method (900) for processing a text sample (101) , wherein the method (900) comprises:

parsing (901) a plurality of tokens from the text sample (101) ;

generating (903) a plurality of n-grams (401) based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;

determining (905) for each of the plurality of n-grams (401) , using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types;

determining (907) for each of the plurality of n-grams (401) , using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types;

determining (909) for each of the plurality of n-grams (401) , using a sequence tagging model, a third score;

determining (911) for each of the plurality of n-grams (401) one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and

ranking (913) the plurality of n-grams (401) based on the one or more composite scores.
A computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (900) of claim 17, when the program code is executed by the computer or the processor.