WO2022165634A1 - Apparatus and method for type matching of a text sample - Google Patents

Apparatus and method for type matching of a text sample Download PDF

Info

Publication number
WO2022165634A1
WO2022165634A1 PCT/CN2021/074859 CN2021074859W WO2022165634A1 WO 2022165634 A1 WO2022165634 A1 WO 2022165634A1 CN 2021074859 W CN2021074859 W CN 2021074859W WO 2022165634 A1 WO2022165634 A1 WO 2022165634A1
Authority
WO
WIPO (PCT)
Prior art keywords
scores
grams
gram
tokens
vector
Prior art date
Application number
PCT/CN2021/074859
Other languages
French (fr)
Inventor
Georgios Stoilos
Nikos PAPASARANTOPOULOS
Pavlos VOUGIOUKLIS
Patrik BANSKY
Yantao Jia
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/CN2021/074859 priority Critical patent/WO2022165634A1/en
Publication of WO2022165634A1 publication Critical patent/WO2022165634A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present disclosure relates to information processing technology. More specifically, the present disclosure relates to an apparatus and method for type matching of a text sample.
  • Annotating text or text samples, such as documents, queries and the like with Knowledge Base resources like entities, relations, and types is an important step for many downstream tasks like semantic search and question answering.
  • entity and relation linking have been extensively studied in the past, less attention has been spent on type linking and matching even though the problem is challenging in many domains.
  • an apparatus for processing, in particular type matching a text sample is provided.
  • the text sample may, for instance, be a user search query and the apparatus based on the type matching may return one or more query results in response to the search query.
  • the apparatus is configured to parse a plurality of tokens (also referred to as terms) from the text sample, i.e. tokenize the text sample, and to generate a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens.
  • a plurality of tokens also referred to as terms
  • the apparatus is further configured to determine for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of a semantic similarity between the n-gram and the one or more first candidate types.
  • the apparatus is further configured to determine for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein (like the one or more first scores) the one or more second scores are indicative of a semantic similarity between the n-gram and the one or more second candidate types.
  • the apparatus is further configured to determine for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type.
  • the apparatus is configured to determine for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score and to rank the plurality of n-grams based on the one or more composite scores.
  • One or more of the highest scoring n-grams and a list of the associated composite scores and candidate types may be provided as the output of the apparatus.
  • the sequence tagging model defines for each of the plurality of tokens of the text sample a probability that the respective token is a type, i.e. belongs to the category TYPE.
  • the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third score as a weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE.
  • the apparatus is configured to determine the weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE by using a negative weight for those of the one or more tokens of the n-gram that are a non-type, i.e. belong to a category other than TYPE.
  • the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third score based on those probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE that exceed a threshold probability. For instance, the apparatus may be configured to determine the third score based on the probabilities of the three tokens having the largest probabilities.
  • the one or more composite scores are one or more weighted sums of the one or more first scores, the one or more second scores and the third score of each of the plurality of n-grams.
  • the apparatus is further configured to:
  • the one or more composite scores based on the one or more first scores, the one or more second scores, the third score and the one or more fourth scores, in particular as a weighted sum of these scores.
  • the apparatus for determining the one or more fourth scores for each of the plurality of n-grams, using the learning-based matching scheme, is configured to:
  • the apparatus for determining the one or more first scores for each of the plurality of n-grams using the term-based matching scheme the apparatus is further configured to normalize the one or more first scores for each of the n-grams based on the number of tokens of the respective n-gram.
  • the apparatus for determining the one or more second scores for each of the plurality of n-grams using the vector-based matching scheme the apparatus is configured to:
  • a plurality of similarity scores e.g. the cosine similarity scores, between the embedding vector for the n-gram and a plurality of further embedding vectors, wherein the plurality of further embedding vectors are based on the plurality of candidate types from the pre-defined dictionary of candidate types;
  • the apparatus for each n-gram the apparatus is configured to determine the embedding vector for the n-gram as a weighted sum of a plurality of token vectors, wherein each token vector is associated with one of the one or more tokens of the n-gram.
  • the text sample is a non-segmented text sample, i.e. in a non-segmented language, such as Chinese, Japanese, Thai and the like, wherein the apparatus is further configured to segment the text sample into the plurality of tokens.
  • the apparatus is further configured to:
  • the text sample is a user search query
  • the apparatus is further configured to provide one or more query results based on one or more of the plurality of n-grams ranked based on the one or more composite scores.
  • a method for processing, in particular type matching a text sample comprises the steps of:
  • each n-gram comprises one or more of the plurality of tokens
  • the method according to the second aspect may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.
  • the type matching method according to the second aspect can be performed by the type matching apparatus according to the first aspect.
  • further features of the type matching method according to the second aspect result directly from the functionality of the type matching apparatus according to the first aspect as well as its different implementation forms and embodiments described above and below.
  • a computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the type matching method according to the second aspect, when the program code is executed by the computer or the processor.
  • Fig. 1 shows a schematic diagram of an apparatus for type matching of a text sample according to an embodiment
  • Fig. 2 shows a schematic diagram illustrating processing blocks implemented by an embedding engine of a type matching apparatus according to an embodiment
  • Fig. 3 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to an embodiment for determining one or more composite scores
  • Fig. 4 shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of first scores for the composite scores of figure 3 using a term-based matching scheme;
  • Fig. 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of second scores for the composite scores of figure 3 using a vector-based matching scheme;
  • Fig. 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to two different embodiments for determining a plurality of third scores for the composite scores of figure 3 using a sequence tagging model;
  • Fig. 7 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to a further embodiment for determining one or more composite scores
  • Fig. 8a shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme;
  • Fig. 8b shows a schematic diagram illustrating the architecture of a neural network for providing a vector representation that is used by the learning-based matching scheme of figure 8a as implemented by a type matching apparatus according to an embodiment
  • Fig. 8c shows a schematic diagram illustrating the architecture of an alternative neural network for providing a vector representation that is used by the learning-based matching scheme of figure 8a as implemented by a type matching apparatus according to an embodiment
  • Fig. 9 is a flow diagram illustrating a type matching method according to an embodiment.
  • a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa.
  • a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps) , even if such one or more units are not explicitly described or illustrated in the figures.
  • a specific apparatus is described based on one or a plurality of units, e.g.
  • a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units) , even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.
  • Figure 1 shows a schematic diagram of an apparatus 100 for type matching of a text sample according to an embodiment.
  • the apparatus 100 may be implemented as a cloud server 100 configured to receive a text sample 101 via the internet.
  • the text sample 101 is a search query 101, namely the exemplary search query “London Bridge Spa” 101 indicating that a user is searching for a “Spa” near “London Bridge” .
  • the type matching apparatus 100 may implement in software and/or hardware a tokenizer engine 103, an embedding interface 105, a term-based matching scheme or engine 107 (referred to as “termTypeMatcher” 107 in figure 1) , a vector-based matching scheme or engine 109 (referred to as “vecTypeMatcher” 109 in figure 1) , a sequence tagging model or engine 111 (referred to as “QueryTagger” in figure 1) , a learning-based matching scheme or engine 115 (referred to as “learningTypeMatcher” 115 in figure 1) and/or a ranking engine 113 (referred to as “Ranker” in figure 1) .
  • the apparatus 100 may comprise, for instance, one or more processors for processing data, a communication interface for receiving and transmitting data, and a memory for storing data.
  • the one or more processors of the apparatus 100 may comprise digital circuitry, or both analog and digital circuitry.
  • Digital circuitry may comprise components such as application-specific integrated circuits (ASICs) , field-programmable arrays (FPGAs) , digital signal processors (DSPs) , or general-purpose processors.
  • the memory may be configured to store executable program code which, when executed by the one or more processors, causes the type matching apparatus 100 to perform the functions and methods described herein.
  • Figure 2 shows a schematic diagram illustrating processing blocks implemented by the embedding engine 105 of the apparatus 100 of figure 1according to an embodiment.
  • the embedding engine 105 is configured to generate based on an input, such as a n-gram, an embedding vector of the n-gram.
  • the n-gram consists of two tokens.
  • the upper processing branch in figure 2 may implement a conventional approach for determining sentence embeddings which consists of retrieving weights, in particular IDFs for each token (processing block 201) , vectors for each token (processing block 207) and computing the weighted average (processing block 209) .
  • a processing block 203 is provided that is configured to further penalize some words that are frequent in the pre-defined dictionary.
  • the lower processing branch in figure 2 allows handling of an n-gram from a text sample 101 from a language that requires segmentation, such as Chinese, Japanese, Thai, Korean and the like.
  • a processing block 205 the n-grams are concatenated again and, if a vector for the concatenation exists, then in a processing block 211 the mean of this vector with the vector computed in the upper processing branch of figure 2 may be computed.
  • This approach improves the accuracy in many languages that require segmentation. Moreover, it reduces the space of queries to be evaluated and results to be considered and hence also the execution time.
  • the apparatus 100 for instance, by means of the tokenizer engine 103 is configured to parse a plurality of tokens (also referred to as terms) from the text sample 101, i.e. to tokenize the text sample 101 (see also processing step 305 in figure 3) , and to generate a plurality of n-grams based on the plurality of tokens (see also processing step 307 in figure 3) , wherein each n-gram comprises one or more of the plurality of tokens.
  • the tokenizer engine 103 may implement standard segmentation techniques (such as PyThaiNLP, Jieba, CoreNLP, JapanezeTokenizer) and/or the known “TokTok” tokenizer.
  • the apparatus 100 is further configured to determine, using the term-based matching scheme or engine 107, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of first scores and an associated plurality of first candidate types from a pre-defined vocabulary of candidate types, as illustrated by processing step 309 in figure 3, which will be described in more detail in the context of figure 4 below.
  • Each first score of the plurality of first scores is indicative of the respective semantic similarity between the n-gram and the respective first candidate type of the plurality of first candidate types.
  • the term-based matching scheme or engine 107 is configured to link each type in the given pre-defined vocabulary of candidate types to an article in a Knowledge Base, such as Wikipedia that closely describes its meaning.
  • a Knowledge Base such as Wikipedia that closely describes its meaning.
  • the Wikipedia article https: //en. wikipedia. org/wiki/Health_club may be used.
  • the Wikipedia article brings synonyms and other related words relevant to the type like “aerobics” , “cycling (spinning) ” , “boxing” , “step yoga” , “regular yoga and hot (Bikram) yoga” , “pilates” that a user can use at query time to refer to that particular type.
  • the term-based matching scheme or engine 107 may score these n-grams according to known ranking models like BM25 or MLM for providing the plurality of first scores.
  • the term-based matching scheme or engine 107 may be further configured to normalize the plurality of first scores.
  • the term-based matching scheme or engine 107 may implement a normalization of the plurality of first scores on the basis of the following equations:
  • Q n is an n-gram, is a candidate type
  • len () is a function returning the length of the n-gram
  • ngr () is a function returning the set of all n-grams of a query.
  • the apparatus 100 is further configured to determine, using the vector-based matching scheme or engine 109, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of second scores and an associated plurality second candidate types from the pre-defined vocabulary of candidate types, as illustrated by processing step 311 in figure 3, which will be described in more detail in the context of figures 5a and 5b below.
  • each score of the plurality of second scores is indicative of the respective semantic similarity between the n-gram and the respective second candidate type of the plurality of second candidate types.
  • the apparatus 100 is further configured to determine, using the sequence tagging model or engine 111, for each of the plurality of tokens of the input text sample 101 provided by the tokenizer engine 103 a probability distribution for all the tokens.
  • the apparatus 100 may implement a conventional sequence tagging model or engine 111, such as Conditional Random Fields (CRFs) , RNNs (LSTMs or GRUs) or Transformer-based sequence tagging, for providing the probability distribution for each token.
  • the sequence tagging model or engine 111 is configured to determine for a given set of categories and the text sample Q 101 the probability distribution on the basis of the following (soft-max) equation:
  • ti is a token in the tokenized text sample
  • Q is the text sample
  • ck is a particular category
  • hi a vector computed by any of the conventional sequence tagging models mentioned above
  • exp is the natural exponential function
  • the apparatus 100 is configured to determine for each of the plurality of n-grams a respective third score based on the probability distribution for all the tokens provided by the sequence tagging model or engine 111, wherein the third score is indicative of the degree in which each n-gram represents a type, and then a plurality of composite scores based on the plurality of first scores, the plurality of second scores and the third score, as illustrated in processing step 315 of figure 3.
  • the apparatus 100 is configured to determine the plurality of composite scores as a respective weighted sum of a respective first score, a respective second score and a respective third score of each of the plurality of n-grams.
  • the plurality of composite score, i.e. final scores may be computed by the ranker engine 113.
  • the apparatus 100 is configured, using the ranker engine 113, to rank the plurality of n-grams based on the plurality of composite scores (or final scores) , as illustrated in processing block 317 of figure 3.
  • the ranker engine 113 is configured to perform a ranking based on a combination of the probability scores, i.e. third scores provided by the sequence tagging model or engine 111 and the first and second scores provided by the term-based matching scheme or engine 107 and the vector-based matching scheme or engine 109, respectively.
  • the ranker engine 113 may be configured to use a mixture model such that given an n-gram [n1 n2 ...nm] the ranker engine 113 sums the probabilities of the tokens that are annotated as [TYPE] , i.e. belong to the category [TYPE] and subtracts the probabilities of tokens that are annotated using a negative set of tags.
  • the ranker engine 113 may be configured to check the probabilities of the annotations for each token in the 2 nd and 3 rd ranks of the probability distributions provided by the sequence tagging model or engine 111.
  • one or more of the highest scoring n-grams and a list of the associated composite scores, i.e. final scores and candidate types may be provided as the output of the apparatus 100, as illustrated in processing block 319 of figure 3.
  • figure 3 shows a schematic diagram illustrating processing blocks implemented by the type matching apparatus 100 according to an embodiment.
  • the input text sample 101 may be a user question or query and the output may be a ranked list of triples of candidate mentions and types with a confidence score of the form ⁇ ni, ti, si>.
  • Figure 3 depicts a hybrid pipeline for type detection and linking, i.e. type matching which is configured to detect which part (n-gram) of the input text sample 101 corresponds to a type and also match that part to the most related dictionary type.
  • the processing chain illustrated in figure 3 first checks what is the language of the input text sample (processing block 301) .
  • processing block 303 If the language is one from the list of languages requiring segmentation it calls an external segmentation engine (processing block 303) suitable for the particular language such as PyThaiNLP (Thai) , CoreNLP/Jieba (Chinese) , or JapaneseTokenizer (Japanese) .
  • the processing chain then proceeds to further tokenize the sentence (processing block 305) using tokenizers known in the art such as TokTok and computes the n-grams of the tokenized text sample (processing block 307) .
  • the processing chain then makes use of the term-based matching scheme or engine 107 (an implementation of which will be described in more detail in the context of figure 4) and the vector-based matching scheme or engine 109 (an implementation of which will be described in more detail in the context of figures 5a and 5b) for detection and matching which assign first and second scores s1 and s2 (respectively) to each candidate mention and type.
  • the sequence tagging model or engine 111 (processing block 313) is used to compute a probability distribution for each of the tokens.
  • the processing chain illustrated in figure 3 computes using the probability distribution the third score, i.e. the tagging score s3 for each n-gram, and a composite score from the first, second, and third score (processing block 315) .
  • figure 4 shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of first scores using the term-based matching scheme or engine 107.
  • the input in figure 4 is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. first score of the form ⁇ ni, ti, si>.
  • the term-based matching scheme or engine 107 is configured to use an inverted index (processing block 405) which associates a type ti to at least one external document (e.g., a Wikipedia document) that best describes the meaning of the type.
  • the association may be performed using a semi-automated mapping or alignment algorithm (processing block 407) between the external document corpus (block 409) and the type dictionary.
  • Each n-gram ni is used as a free text query in order to retrieve candidate types ti for ni (processing block 402) .
  • Each retrieved candidate is associated with a ranking score computed by the full-text engine (processing block 403) (common ranking models include, but are not limited to BM25, MLM, and more) .
  • the first scores returned by the full text engine may be normalized according to the maximum score returned and the length of each n-gram (see processing blocks 411 and 413 of figure 4) .
  • the candidate triples are ranked and returned to the further processing blocks of figure 3 (see processing blocks 415 and 417 of figure 4) .
  • figures 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of second scores using the vector-based matching scheme or engine 109.
  • the input of the processing chain illustrated in figure 5a is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. second score of the form ⁇ ni, ti, si>.
  • the processing block 501 calls the processing blocks shown in figure 5b to compute a respective embedding vector for each n-gram.
  • the processing chain shown in figure 5a also loads and computes a vector embedding for each type in the pre-defined dictionary (see processing blocks 503 and 505 of figure 5a) .
  • the processing proceeds by computing the cosine similarity between each n-gram vector and each type-vector in processing block 507.
  • the cosine similarity also assigns a similarity (confidence) score between each pair.
  • the algorithm finally ranks the candidate triples and returns them to the processing blocks of figure 3 (see processing blocks 509 and 511 of figure 5a) .
  • the input of the processing chain illustrated in figure 5b is an n-gram 101 bconsisting of a list of tokens and the output is an embedding of the n-gram into a word vector space, i.e. an embedding vector.
  • the upper processing branch of figure 5b is using a weighted average formula to compute the embedding vector (see in particular processing block 535 of figure 5b) .
  • Such a weighted average can be computed using for each token of the n-gram its idf (inverse document frequency) computed in a large document corpus (see blocks 521, 523, and 529) and a pre-computed vector (see block 533) that can be found in vector dictionaries such as fasttext, word2vec, GloVe, Numberbatch, BPEmb.
  • a list of common tokens mined from the type dictionary may be used for which their weight is further reduced (see processing blocks 525 and 531 of figure 5b) .
  • the vector-based matching scheme or engine 109 may implement a further processing branch (in figure 5b the lower processing branch) for determining embedding vectors for words in languages that require segmentation, such as Chinese, Japanese, Thai, Korean and the like.
  • This alternative processing branch includes concatenating the n-gram tokens without using spaces (see processing block 539 of figure 5b) , retrieving a vector for the concatenation (see processing block 541 of figure 5b) and finally taking the mean of the retrieved vector with the one computed using the weighted average approach (see processing block 549 of figure 5b) .
  • this additional vector may be set to the zero vector (block 543) so then then processing chain proceeds to decide and return the weighted average vector (blocks 545 and 547) computed as described above.
  • figures 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to two different embodiments for determining the plurality of third scores using a sequence tagging model or engine 111.
  • the input of the processing chain illustrated in figure 6a is a single n-gram 101 b and the probability distribution computed by the sequence tagger (see blocks 601 and 603) .
  • the output is a third score about the confidence that this n-gram represents a type (see processing block 611 of figure 6a) .
  • the score is computed as a weighted average of the probability that each individual token of that n-gram is believed to be a type or is believed to be of some other category.
  • the processing chain considers for each token the most likely category predicted by the sequence tagger (see processing block 607) and tokens annotated with type (T) are counted positive whereas tokens annotated with categories from a list of negative categories are counted negatively (see processing blocks 605 and 609 of figure 6a) .
  • the processing scheme illustrated in figure 6a is further refined by using a more complex logic.
  • the main difference with the processing scheme shown in figure 6a is that, for instance, the top-3 ranks of probabilities and category candidates are used in order to compute a score for the given n-gram based again on its individual tokens and the probability distribution computed by the sequence tagger (see blocks 621 and 623) .
  • the processing chain initially sets the third score to 0 (see block 625) and then iterates over the tokens of the n-gram (see blocks 627 and 631) counting positively tokens annotated as types in the first rank (blocks 633 and 635) , or annotated as types in the second rank (see blocks 637 and 639) , or annotated as types in the third rank but provided that neither in the first nor in the second rank they are annotated with one of the negative categories (see blocks 641 and 643) .
  • the scheme counts negatively tokens annotated with one of the negative categories in the first rank (see blocks 645 and 647) or annotated with one of the negative categories in the second rank (see blocks 649 and 651) .
  • the probability used to count positively or negatively in the computation of the third score may be scaled according to the rank that is used (see blocks 635, 639, 643, 647, and 651) . Categories of lower rank count less in the overall third score, whereas categories of higher ranks more. When all tokens of the n-gram have been processed the third score is returned (see block 629) .
  • Figure 7 shows a variant of the processing scheme of figure 3 that may be implemented by the type matching apparatus 100 according to a further embodiment.
  • the main difference with respect to the processing scheme shown in figure 3 is that in addition to the term-based (block 709) and the vector-based schemes (block 715) a learning-based matching schemeis used (see processing block 711) to compute a fourth score.
  • this fourth score is considered by processing block 717 in order to compute a composite score.
  • Figure 8a shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme.
  • the input of the processing chain is the list of n-grams, the list of types from the dictionary and the probability distribution as computed by the sequence tagger 111.
  • the processing chain illustrated in figure 8a retrieves the computed probability distribution (block 803) from the sequence tagger 111 (block 801) and uses it to compute a vector representation g for the input sentence (block 805) using a first neural network architecture.
  • This architecture can be any of those known in the art such as and not limited to ConvNets, NeuralNetworks using weighted sums and non-linearities, sequence-to-sequence, transformers, attention, and more and particular examples of implementations are given next.
  • the processing chain illustrated in figure 8a computes embedding vectors for the n-grams (block 807) and embedding vectors for the plurality of types (blocks 811 and 809) using the procedure of figure 5b that has been described in detail further above. All vector representations (the sentence vector g computed in 805, the n-gram and the type) are concatenated and multiplied by a matrix (see block 813) that represents a second neural network.
  • the chain proceeds to compute the energy of the final vector as the L1 or L2 norm (block 815) , uses the energies as scores to rank the candidates (block 817) and finally returns the highest rank n-gram and candidate as the fourth score.
  • Figure 8b shows a schematic diagram illustrating a possible architecture of a first neural network for providing a vector representation from the probability distribution in processing block 805 of the learning-based matching scheme of figure 8a as implemented by the type matching apparatus 100 according to an embodiment.
  • the figure shows a particular implementation of a ConvNet network (see block 833) for computing a vector representation.
  • the probability distribution from the sequence tagger 111 is arranged as a matrix of dimensions ⁇ xm where ⁇ is the length of the input sentence (block 835) and m is the number of categories.
  • This matrix is convoluted with a plurality of filters of size kxm, also called a volume (block 837) and the output is flattened and further processed by a fully-connected layer (block 839) .
  • the final vector is returned to the further processing blocks of figure 8.
  • Figure 8c shows a schematic diagram illustrating the architecture of another possible implementation of a first neural network for providing a vector representation in processing block 805 of the learning-based matching scheme as implemented by the type matching apparatus 100 according to an embodiment.
  • This is a particular example of a weighted average and non-linearity network (block 843) .
  • the probabilities that each token belongs to each category are used to compute a weighted average which is then processed by a non-linear function such as but not limited to ReLu and leaky-Relu.
  • the final vector is returned to the further processing blocks of figure 8.
  • Figure 9 is a flow diagram illustrating different steps of a method 900 for processing, in particular type matching the text sample 101 is provided.
  • the method 900 comprises the steps of:
  • parsing 901 a plurality of tokens from the text sample 101, i.e. tokenizing the text sample 101;
  • each n-gram comprises one or more of the plurality of tokens
  • determining 905 for each of the plurality of n-grams using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of the semantic similarity between the n-gram and the one or more first candidate types;
  • determining 907 for each of the plurality of n-grams using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein the one or more second scores are indicative of the semantic similarity between the n-gram and the one or more second candidate types;
  • the type matching method 900 may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.
  • the type matching method 900 may be performed by the type matching apparatus 100 described above. Thus, further features of the type matching method 900 result directly from the functionality of the type matching apparatus 100 and its different embodiments described above and below.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described embodiment of an apparatus is merely exemplary.
  • the unit division is merely logical function division and may be another division in an actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An apparatus (100) for processing a text sample (101) is disclosed. The apparatus (100) is configured to tokenize the text sample (101) and generate a plurality of n-grams based on the plurality of tokens. The apparatus (100) is further configured to determine for each n-gram, using a term-based matching scheme (107), one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types and, using a vector-based matching scheme (109), one or more second scores and one or more second candidate types. Moreover, the apparatus (100) is configured to determine for each n-gram, using a sequence tagging model (111), a third score and to determine one or more composite scores based on the one or more first scores, the one or more second scores and the third score. The apparatus (100) is further configured to rank the plurality of n-grams based on the one or more composite scores and may output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.

Description

Apparatus and method for type matching of a text sample TECHNICAL FIELD
The present disclosure relates to information processing technology. More specifically, the present disclosure relates to an apparatus and method for type matching of a text sample.
BACKGROUND
Annotating text or text samples, such as documents, queries and the like with Knowledge Base resources like entities, relations, and types is an important step for many downstream tasks like semantic search and question answering. Although entity and relation linking have been extensively studied in the past, less attention has been spent on type linking and matching even though the problem is challenging in many domains.
SUMMARY
It is an objective of the present disclosure to provide an improved apparatus and method for type matching of a text sample.
The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect an apparatus for processing, in particular type matching a text sample is provided. The text sample may, for instance, be a user search query and the apparatus based on the type matching may return one or more query results in response to the search query.
The apparatus is configured to parse a plurality of tokens (also referred to as terms) from the text sample, i.e. tokenize the text sample, and to generate a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens.
The apparatus is further configured to determine for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of a semantic similarity between the n-gram and the one or more first candidate types.
The apparatus is further configured to determine for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein (like the one or more first scores) the one or more second scores are indicative of a semantic similarity between the n-gram and the one or more second candidate types.
The apparatus is further configured to determine for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type.
Moreover, the apparatus is configured to determine for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score and to rank the plurality of n-grams based on the one or more composite scores. One or more of the highest scoring n-grams and a list of the associated composite scores and candidate types may be provided as the output of the apparatus.
In a further possible implementation form of the first aspect, the sequence tagging model defines for each of the plurality of tokens of the text sample a probability that the respective token is a type, i.e. belongs to the category TYPE.
In a further possible implementation form of the first aspect, the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third score as a weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE.
In a further possible implementation form of the first aspect, the apparatus is configured to determine the weighted average of the probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE by using a negative weight for those of the one or more tokens of the n-gram that are a non-type, i.e. belong to a category other than TYPE.
In a further possible implementation form of the first aspect, the apparatus is configured to determine for each of the plurality of n-grams, using the sequence tagging model, the third  score based on those probabilities that each of the one or more tokens of the n-gram is a type, i.e. belongs to the category TYPE that exceed a threshold probability. For instance, the apparatus may be configured to determine the third score based on the probabilities of the three tokens having the largest probabilities.
In a further possible implementation form of the first aspect, the one or more composite scores are one or more weighted sums of the one or more first scores, the one or more second scores and the third score of each of the plurality of n-grams.
In a further possible implementation form of the first aspect, the apparatus is further configured to:
determine for each of the plurality of n-grams, using a learning-based matching scheme based on a feed-forward neural network, one or more fourth scores and one or more fourth candidate types from the pre-defined vocabulary of candidate types, wherein the one or more fourth scores are indicative of a semantic similarity between the n-gram and the one or more fourth candidate types; and
determine for each of the plurality of n-grams the one or more composite scores based on the one or more first scores, the one or more second scores, the third score and the one or more fourth scores, in particular as a weighted sum of these scores.
In a further possible implementation form of the first aspect, for determining the one or more fourth scores for each of the plurality of n-grams, using the learning-based matching scheme, the apparatus is configured to:
determine an embedding vector for the text sample from the probabilities of each of the one or more of its tokens using a first neural network;
determine an embedding vector for the n-gram and a plurality of further embedding vectors for the plurality of candidate types from the pre-defined vocabulary of candidate types; and
feed the embedding vector for the text sample, the embedding vector for the n-gram and the further plurality of embedding vectors into a second feed-forward neural network for determining the one or more fourth scores.
In a further possible implementation form of the first aspect, for determining the one or more first scores for each of the plurality of n-grams using the term-based matching scheme the apparatus is further configured to normalize the one or more first scores for each of the n-grams based on the number of tokens of the respective n-gram.
In a further possible implementation form of the first aspect, for determining the one or more second scores for each of the plurality of n-grams using the vector-based matching scheme the apparatus is configured to:
determine an embedding vector for the n-gram;
determine a plurality of similarity scores, e.g. the cosine similarity scores, between the embedding vector for the n-gram and a plurality of further embedding vectors, wherein the  plurality of further embedding vectors are based on the plurality of candidate types from the pre-defined dictionary of candidate types; and
determine one or more of the largest similarity scores as the one or more second scores for the respective n-gram.
In a further possible implementation form of the first aspect, for each n-gram the apparatus is configured to determine the embedding vector for the n-gram as a weighted sum of a plurality of token vectors, wherein each token vector is associated with one of the one or more tokens of the n-gram.
In a further possible implementation form of the first aspect, the text sample is a non-segmented text sample, i.e. in a non-segmented language, such as Chinese, Japanese, Thai and the like, wherein the apparatus is further configured to segment the text sample into the plurality of tokens.
In a further possible implementation form of the first aspect, the apparatus is further configured to:
concatenate the tokens of the n-gram for obtaining a composite token comprising a continuous string of characters;
retrieve a composite vector associated with the composite token from a dictionary of vectors; and
determine the embedding vector as the average vector of the composite vector and the weighted sum of the plurality of token vectors.
In a further possible implementation form of the first aspect, the text sample is a user search query, wherein the apparatus is further configured to provide one or more query results based on one or more of the plurality of n-grams ranked based on the one or more composite scores.
According to a second aspect a method for processing, in particular type matching a text sample is provided. The method comprises the steps of:
parsing a plurality of tokens from the text sample, i.e. tokenizing the text sample;
generating a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;
determining for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of a semantic similarity between the n-gram and the one or more first candidate types;
determining for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein the one or more second scores are indicative of a semantic similarity between the n-gram and the one or more second candidate types;  determining for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type;
determining for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and
ranking the plurality of n-grams based on the one or more composite scores.
The method according to the second aspect may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.
The type matching method according to the second aspect can be performed by the type matching apparatus according to the first aspect. Thus, further features of the type matching method according to the second aspect result directly from the functionality of the type matching apparatus according to the first aspect as well as its different implementation forms and embodiments described above and below.
According to a third aspect a computer program product is provided, comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the type matching method according to the second aspect, when the program code is executed by the computer or the processor.
Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.
BRIEF DESCRIPTION OF THE DRAWINGS
In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:
Fig. 1 shows a schematic diagram of an apparatus for type matching of a text sample according to an embodiment;
Fig. 2 shows a schematic diagram illustrating processing blocks implemented by an embedding engine of a type matching apparatus according to an embodiment;
Fig. 3 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to an embodiment for determining one or more composite scores;
Fig. 4 shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of first scores for the composite scores of figure 3 using a term-based matching scheme;
Fig. 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of second scores for the composite scores of figure 3 using a vector-based matching scheme;
Fig. 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by a type matching apparatus according to two different embodiments for determining a plurality of third scores for the composite scores of figure 3 using a sequence tagging model;
Fig. 7 shows a schematic diagram illustrating processing blocks implemented by a type matching apparatus according to a further embodiment for determining one or more composite scores;
Fig. 8a shows a schematic diagram illustrating in more detail processing blocks implemented by a type matching apparatus according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme;
Fig. 8b shows a schematic diagram illustrating the architecture of a neural network for providing a vector representation that is used by the learning-based matching scheme of figure 8a as implemented by a type matching apparatus according to an embodiment; and
Fig. 8c shows a schematic diagram illustrating the architecture of an alternative neural network for providing a vector representation that is used by the learning-based matching  scheme of figure 8a as implemented by a type matching apparatus according to an embodiment; and
Fig. 9 is a flow diagram illustrating a type matching method according to an embodiment.
In the following, identical reference signs refer to identical or at least functionally equivalent features.
DETAILED DESCRIPTION OF THE EMBODIMENTS
In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.
For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps) , even if such one or more units are not explicitly described or illustrated  in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units) , even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.
Figure 1 shows a schematic diagram of an apparatus 100 for type matching of a text sample according to an embodiment. In an embodiment, the apparatus 100 may be implemented as a cloud server 100 configured to receive a text sample 101 via the internet. In the embodiment shown in figure 1, the text sample 101 is a search query 101, namely the exemplary search query “London Bridge Spa” 101 indicating that a user is searching for a “Spa” near “London Bridge” .
As illustrated in figure 1, the type matching apparatus 100 may implement in software and/or hardware a tokenizer engine 103, an embedding interface 105, a term-based matching scheme or engine 107 (referred to as “termTypeMatcher” 107 in figure 1) , a vector-based matching scheme or engine 109 (referred to as “vecTypeMatcher” 109 in figure 1) , a sequence tagging model or engine 111 (referred to as “QueryTagger” in figure 1) , a learning-based matching scheme or engine 115 (referred to as “learningTypeMatcher” 115 in figure 1) and/or a ranking engine 113 (referred to as “Ranker” in figure 1) . For implementing one or more of these components, which will be described in more detail below, in software and/or hardware the apparatus 100 may  comprise, for instance, one or more processors for processing data, a communication interface for receiving and transmitting data, and a memory for storing data. The one or more processors of the apparatus 100 may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs) , field-programmable arrays (FPGAs) , digital signal processors (DSPs) , or general-purpose processors. The memory may be configured to store executable program code which, when executed by the one or more processors, causes the type matching apparatus 100 to perform the functions and methods described herein.
Figure 2 shows a schematic diagram illustrating processing blocks implemented by the embedding engine 105 of the apparatus 100 of figure 1according to an embodiment. Generally, the embedding engine 105 is configured to generate based on an input, such as a n-gram, an embedding vector of the n-gram. In the example shown in figure 2 the n-gram consists of two tokens. The upper processing branch in figure 2 may implement a conventional approach for determining sentence embeddings which consists of retrieving weights, in particular IDFs for each token (processing block 201) , vectors for each token (processing block 207) and computing the weighted average (processing block 209) . In the embodiment shown in figure 2, however, a processing block 203 is provided that is configured to further penalize some words that are frequent in the pre-defined dictionary. The lower processing branch in figure 2 allows handling of an n-gram from a text sample 101 from a language that requires segmentation, such as Chinese, Japanese, Thai, Korean and the like. In a processing block 205 the n-grams are concatenated again and, if a vector for the concatenation exists, then in a processing block 211 the mean of this vector with the vector computed in the upper processing branch of figure 2 may be computed. This approach improves the accuracy in many languages that require  segmentation. Moreover, it reduces the space of queries to be evaluated and results to be considered and hence also the execution time.
As will be described in more detail below under further reference to the figures 3, 4, 5a, 5b and 6a, 6b, the apparatus 100, for instance, by means of the tokenizer engine 103 is configured to parse a plurality of tokens (also referred to as terms) from the text sample 101, i.e. to tokenize the text sample 101 (see also processing step 305 in figure 3) , and to generate a plurality of n-grams based on the plurality of tokens (see also processing step 307 in figure 3) , wherein each n-gram comprises one or more of the plurality of tokens. In an embodiment, the tokenizer engine 103 may implement standard segmentation techniques (such as PyThaiNLP, Jieba, CoreNLP, JapanezeTokenizer) and/or the known “TokTok” tokenizer.
The apparatus 100 is further configured to determine, using the term-based matching scheme or engine 107, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of first scores and an associated plurality of first candidate types from a pre-defined vocabulary of candidate types, as illustrated by processing step 309 in figure 3, which will be described in more detail in the context of figure 4 below. Each first score of the plurality of first scores is indicative of the respective semantic similarity between the n-gram and the respective first candidate type of the plurality of first candidate types.
As will be described in more detail below in the context of figure 4, in an embodiment, the term-based matching scheme or engine 107 is configured to link each type in the given pre-defined vocabulary of candidate types to an article in a Knowledge Base, such as  Wikipedia that closely describes its meaning. For example, for the type “FitnessClub” the Wikipedia article  https: //en. wikipedia. org/wiki/Health_club may be used. In this way the Wikipedia article brings synonyms and other related words relevant to the type like “aerobics” , “cycling (spinning) ” , “boxing” , “step yoga” , “regular yoga and hot (Bikram) yoga” , “pilates” that a user can use at query time to refer to that particular type. Then, the input query n-grams are treated by the term-based matching scheme or engine 107 as queries issued over the index in order to link them to actual types. In an embodiment, the term-based matching scheme or engine 107 may score these n-grams according to known ranking models like BM25 or MLM for providing the plurality of first scores. In order to alleviate a bias to give higher scores to longer n-grams, in an embodiment, the term-based matching scheme or engine 107 may be further configured to normalize the plurality of first scores. In an embodiment, the term-based matching scheme or engine 107 may implement a normalization of the plurality of first scores on the basis of the following equations:
Figure PCTCN2021074859-appb-000001
Figure PCTCN2021074859-appb-000002
where Q n is an n-gram, 
Figure PCTCN2021074859-appb-000003
is a candidate type, len () is a function returning the length of the n-gram, and ngr () is a function returning the set of all n-grams of a query.
The apparatus 100 is further configured to determine, using the vector-based matching scheme or engine 109, for each of the plurality of n-grams provided by the tokenizer engine 103 a plurality of second scores and an associated plurality second candidate  types from the pre-defined vocabulary of candidate types, as illustrated by processing step 311 in figure 3, which will be described in more detail in the context of figures 5a and 5b below. Like the plurality of first scores, each score of the plurality of second scores is indicative of the respective semantic similarity between the n-gram and the respective second candidate type of the plurality of second candidate types.
The apparatus 100 is further configured to determine, using the sequence tagging model or engine 111, for each of the plurality of tokens of the input text sample 101 provided by the tokenizer engine 103 a probability distribution for all the tokens. In an embodiment, the apparatus 100 may implement a conventional sequence tagging model or engine 111, such as Conditional Random Fields (CRFs) , RNNs (LSTMs or GRUs) or Transformer-based sequence tagging, for providing the probability distribution for each token. In an embodiment, the sequence tagging model or engine 111 is configured to determine for a given set of categories and the text sample Q 101 the probability distribution on the basis of the following (soft-max) equation:
Figure PCTCN2021074859-appb-000004
where ti is a token in the tokenized text sample, Q is the text sample, ck is a particular category, hi a vector computed by any of the conventional sequence tagging models mentioned above and exp is the natural exponential function.
Moreover, the apparatus 100 is configured to determine for each of the plurality of n-grams a respective third score based on the probability distribution for all the tokens provided by the sequence tagging model or engine 111, wherein the third score is  indicative of the degree in which each n-gram represents a type, and then a plurality of composite scores based on the plurality of first scores, the plurality of second scores and the third score, as illustrated in processing step 315 of figure 3. In the embodiment illustrated in figure 3, the apparatus 100 is configured to determine the plurality of composite scores as a respective weighted sum of a respective first score, a respective second score and a respective third score of each of the plurality of n-grams. In an embodiment, the plurality of composite score, i.e. final scores may be computed by the ranker engine 113.
Furthermore, the apparatus 100 is configured, using the ranker engine 113, to rank the plurality of n-grams based on the plurality of composite scores (or final scores) , as illustrated in processing block 317 of figure 3. Thus, as described above, the ranker engine 113 is configured to perform a ranking based on a combination of the probability scores, i.e. third scores provided by the sequence tagging model or engine 111 and the first and second scores provided by the term-based matching scheme or engine 107 and the vector-based matching scheme or engine 109, respectively.
In an embodiment that will be described in more detail below in the context of figure 6a, the ranker engine 113 may be configured to use a mixture model such that given an n-gram [n1 n2 …nm] the ranker engine 113 sums the probabilities of the tokens that are annotated as [TYPE] , i.e. belong to the category [TYPE] and subtracts the probabilities of tokens that are annotated using a negative set of tags. In a further embodiment that will be described in more detail below in the context of figure 6b, the ranker engine 113 may be configured to check the probabilities of the annotations for each token in the 2 nd and 3 rd  ranks of the probability distributions provided by the sequence tagging model or engine 111.
Once the plurality of n-grams have been ranked by the ranker engine 113 , one or more of the highest scoring n-grams and a list of the associated composite scores, i.e. final scores and candidate types may be provided as the output of the apparatus 100, as illustrated in processing block 319 of figure 3.
As already described above, figure 3 shows a schematic diagram illustrating processing blocks implemented by the type matching apparatus 100 according to an embodiment. As already mentioned, the input text sample 101 may be a user question or query and the output may be a ranked list of triples of candidate mentions and types with a confidence score of the form <ni, ti, si>. Figure 3 depicts a hybrid pipeline for type detection and linking, i.e. type matching which is configured to detect which part (n-gram) of the input text sample 101 corresponds to a type and also match that part to the most related dictionary type. As already described above, the processing chain illustrated in figure 3 first checks what is the language of the input text sample (processing block 301) . If the language is one from the list of languages requiring segmentation it calls an external segmentation engine (processing block 303) suitable for the particular language such as PyThaiNLP (Thai) , CoreNLP/Jieba (Chinese) , or JapaneseTokenizer (Japanese) . The processing chain then proceeds to further tokenize the sentence (processing block 305) using tokenizers known in the art such as TokTok and computes the n-grams of the tokenized text sample (processing block 307) . As already described above, the processing chain then makes use of the term-based matching scheme or engine 107 (an implementation of which will be described in more detail in the context of figure 4) and the vector-based  matching scheme or engine 109 (an implementation of which will be described in more detail in the context of figures 5a and 5b) for detection and matching which assign first and second scores s1 and s2 (respectively) to each candidate mention and type. Moreover, as already described above, the sequence tagging model or engine 111 (processing block 313) is used to compute a probability distribution for each of the tokens. Subsequently, the processing chain illustrated in figure 3 computes using the probability distribution the third score, i.e. the tagging score s3 for each n-gram, and a composite score from the first, second, and third score (processing block 315) .
As already described above, figure 4 shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of first scores using the term-based matching scheme or engine 107. The input in figure 4 is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. first score of the form <ni, ti, si>. In the embodiment shown in figure 4, the term-based matching scheme or engine 107 is configured to use an inverted index (processing block 405) which associates a type ti to at least one external document (e.g., a Wikipedia document) that best describes the meaning of the type. The association may be performed using a semi-automated mapping or alignment algorithm (processing block 407) between the external document corpus (block 409) and the type dictionary. Each n-gram ni is used as a free text query in order to retrieve candidate types ti for ni (processing block 402) . Each retrieved candidate is associated with a ranking score computed by the full-text engine (processing block 403) (common ranking models include, but are not limited to BM25, MLM, and more) . The first scores returned by the full text engine may be normalized according to the maximum  score returned and the length of each n-gram (see  processing blocks  411 and 413 of figure 4) . Finally, the candidate triples are ranked and returned to the further processing blocks of figure 3 (see  processing blocks  415 and 417 of figure 4) .
As already described above, figures 5a and 5b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining the plurality of second scores using the vector-based matching scheme or engine 109. The input of the processing chain illustrated in figure 5a is a list of n-grams 101a computed during the execution of the processing blocks of figure 3 and the output is a list of triples of candidate mentions and types with a confidence, i.e. second score of the form <ni, ti, si>. For each input n-gram the processing block 501 calls the processing blocks shown in figure 5b to compute a respective embedding vector for each n-gram. Furthermore, in a pre-processing step the processing chain shown in figure 5a also loads and computes a vector embedding for each type in the pre-defined dictionary (see  processing blocks  503 and 505 of figure 5a) . The processing proceeds by computing the cosine similarity between each n-gram vector and each type-vector in processing block 507. The cosine similarity also assigns a similarity (confidence) score between each pair. The algorithm finally ranks the candidate triples and returns them to the processing blocks of figure 3 (see  processing blocks  509 and 511 of figure 5a) .
The input of the processing chain illustrated in figure 5b is an n-gram 101 bconsisting of a list of tokens and the output is an embedding of the n-gram into a word vector space, i.e. an embedding vector. As already described above, in the case of a text sample 101 in a segmented language, the upper processing branch of figure 5b is using a weighted  average formula to compute the embedding vector (see in particular processing block 535 of figure 5b) . Such a weighted average can be computed using for each token of the n-gram its idf (inverse document frequency) computed in a large document corpus (see  blocks  521, 523, and 529) and a pre-computed vector (see block 533) that can be found in vector dictionaries such as fasttext, word2vec, GloVe, Numberbatch, BPEmb. However, a list of common tokens mined from the type dictionary (see blocks 525 and 527) may be used for which their weight is further reduced (see  processing blocks  525 and 531 of figure 5b) . As already described above, the vector-based matching scheme or engine 109 may implement a further processing branch (in figure 5b the lower processing branch) for determining embedding vectors for words in languages that require segmentation, such as Chinese, Japanese, Thai, Korean and the like. This alternative processing branch includes concatenating the n-gram tokens without using spaces (see processing block 539 of figure 5b) , retrieving a vector for the concatenation (see processing block 541 of figure 5b) and finally taking the mean of the retrieved vector with the one computed using the weighted average approach (see processing block 549 of figure 5b) . In languages not requiring segmentation this additional vector may be set to the zero vector (block 543) so then then processing chain proceeds to decide and return the weighted average vector (blocks 545 and 547) computed as described above.
As already described above, figures 6a and 6b show schematic diagrams illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to two different embodiments for determining the plurality of third scores using a sequence tagging model or engine 111. The input of the processing chain illustrated in figure 6a is a single n-gram 101 b and the probability distribution computed by the sequence tagger (see blocks 601 and 603) . The output is a third score about the  confidence that this n-gram represents a type (see processing block 611 of figure 6a) . The score is computed as a weighted average of the probability that each individual token of that n-gram is believed to be a type or is believed to be of some other category. The processing chain considers for each token the most likely category predicted by the sequence tagger (see processing block 607) and tokens annotated with type (T) are counted positive whereas tokens annotated with categories from a list of negative categories are counted negatively (see  processing blocks  605 and 609 of figure 6a) .
In one embodiment, depicted in Figure 6b, the processing scheme illustrated in figure 6a is further refined by using a more complex logic. The main difference with the processing scheme shown in figure 6a is that, for instance, the top-3 ranks of probabilities and category candidates are used in order to compute a score for the given n-gram based again on its individual tokens and the probability distribution computed by the sequence tagger (see blocks 621 and 623) . The processing chain initially sets the third score to 0 (see block 625) and then iterates over the tokens of the n-gram (see blocks 627 and 631) counting positively tokens annotated as types in the first rank (blocks 633 and 635) , or annotated as types in the second rank (see blocks 637 and 639) , or annotated as types in the third rank but provided that neither in the first nor in the second rank they are annotated with one of the negative categories (see blocks 641 and 643) . In addition, the scheme counts negatively tokens annotated with one of the negative categories in the first rank (see blocks 645 and 647) or annotated with one of the negative categories in the second rank (see blocks 649 and 651) . In any of the preceding steps the probability used to count positively or negatively in the computation of the third score may be scaled according to the rank that is used (see  blocks  635, 639, 643, 647, and 651) . Categories of lower rank count less in the overall third score, whereas categories of higher ranks more.  When all tokens of the n-gram have been processed the third score is returned (see block 629) .
Figure 7 shows a variant of the processing scheme of figure 3 that may be implemented by the type matching apparatus 100 according to a further embodiment. The main difference with respect to the processing scheme shown in figure 3 is that in addition to the term-based (block 709) and the vector-based schemes (block 715) a learning-based matching schemeis used (see processing block 711) to compute a fourth score. In addition, to figure 3 this fourth score is considered by processing block 717 in order to compute a composite score. For otherwise the rest of the processing blocks and the chain of process is the same as that in figure 3, and is briefly summarized as checking if input language requires segmentation (block 701) and if necessary segmenting it (block 703) , tokenizing the input text sample (block 705) , computing the n-grams (block 707) , computing the first score (block 709) , the second score (block 715) and the probability distribution (block 713) , computing a composite score from all scores (block 717) , ranking composite scores of n-grams (block 719) and returning the top ranked n-grams (block 721) .
Figure 8a shows a schematic diagram illustrating in more detail processing blocks implemented by the type matching apparatus 100 according to an embodiment for determining a plurality of fourth scores for the composite scores of figure 7 using a learning-based matching scheme. The input of the processing chain is the list of n-grams, the list of types from the dictionary and the probability distribution as computed by the sequence tagger 111. The processing chain illustrated in figure 8a retrieves the computed probability distribution (block 803) from the sequence tagger 111 (block 801) and uses it  to compute a vector representation g for the input sentence (block 805) using a first neural network architecture. This architecture can be any of those known in the art such as and not limited to ConvNets, NeuralNetworks using weighted sums and non-linearities, sequence-to-sequence, transformers, attention, and more and particular examples of implementations are given next. In addition, the processing chain illustrated in figure 8a computes embedding vectors for the n-grams (block 807) and embedding vectors for the plurality of types (blocks 811 and 809) using the procedure of figure 5b that has been described in detail further above. All vector representations (the sentence vector g computed in 805, the n-gram and the type) are concatenated and multiplied by a matrix (see block 813) that represents a second neural network. The chain proceeds to compute the energy of the final vector as the L1 or L2 norm (block 815) , uses the energies as scores to rank the candidates (block 817) and finally returns the highest rank n-gram and candidate as the fourth score.
Figure 8b shows a schematic diagram illustrating a possible architecture of a first neural network for providing a vector representation from the probability distribution in processing block 805 of the learning-based matching scheme of figure 8a as implemented by the type matching apparatus 100 according to an embodiment. The figure shows a particular implementation of a ConvNet network (see block 833) for computing a vector representation. The probability distribution from the sequence tagger 111 is arranged as a matrix of dimensions λxm where λ is the length of the input sentence (block 835) and m is the number of categories. This matrix is convoluted with a plurality of filters of size kxm, also called a volume (block 837) and the output is flattened and further processed by a fully-connected layer (block 839) . The final vector is returned to the further processing blocks of figure 8.
Figure 8c shows a schematic diagram illustrating the architecture of another possible implementation of a first neural network for providing a vector representation in processing block 805 of the learning-based matching scheme as implemented by the type matching apparatus 100 according to an embodiment. This is a particular example of a weighted average and non-linearity network (block 843) . The probabilities that each token belongs to each category are used to compute a weighted average which is then processed by a non-linear function such as but not limited to ReLu and leaky-Relu. The final vector is returned to the further processing blocks of figure 8.
Figure 9 is a flow diagram illustrating different steps of a method 900 for processing, in particular type matching the text sample 101 is provided. The method 900 comprises the steps of:
parsing 901 a plurality of tokens from the text sample 101, i.e. tokenizing the text sample 101;
generating 903 a plurality of n-grams based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;
determining 905 for each of the plurality of n-grams, using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types, wherein the one or more first scores are indicative of the semantic similarity between the n-gram and the one or more first candidate types;
determining 907 for each of the plurality of n-grams, using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types, wherein the one or more second scores are indicative of the semantic similarity between the n-gram and the one or more second candidate types;
determining 909 for each of the plurality of n-grams, using a sequence tagging model, a third score, wherein the third score is indicative of the degree in which each n-gram represents a type;
determining 911 for each of the plurality of n-grams one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and
ranking 913 the plurality of n-grams 401 based on the one or more composite scores.
The type matching method 900 may further comprise the step of returning as output one or more of the highest scoring n-grams and a list of the associated composite scores and candidate types.
The type matching method 900 may be performed by the type matching apparatus 100 described above. Thus, further features of the type matching method 900 result directly from the functionality of the type matching apparatus 100 and its different embodiments described above and below.
The person skilled in the art will understand that the "blocks" ( "units" ) of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step) .
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims (18)

  1. An apparatus (100) for processing a text sample (101) , wherein the apparatus (100) is configured to:
    parse a plurality of tokens from the text sample (101) ;
    generate a plurality of n-grams (401) based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;
    determine for each of the plurality of n-grams (401) , using a term-based matching scheme (107) , one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types;
    determine for each of the plurality of n-grams (401) , using a vector-based matching scheme (109) , one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types;
    determine for each of the plurality of n-grams (401) , using a sequence tagging model (111) , a third score;
    determine for each of the plurality of n-grams (401) one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and
    rank the plurality of n-grams (401) based on the one or more composite scores.
  2. The apparatus (100) of claim 1, wherein the sequence tagging model (111) defines for each of the plurality of tokens of the text sample (101) a probability that the token is a type.
  3. The apparatus (100) of claim 2, wherein the apparatus (100) is configured to determine for each of the plurality of n-grams (401) , using the sequence tagging model (111) , the third score as a weighted average of the probabilities that each of the one or more tokens of the n-gram is a type.
  4. The apparatus (100) of claim 3, wherein the apparatus (100) is configured to determine the weighted average of the probabilities that each of the one or more tokens of the n-gram is a type by using a negative weight for those of the one or more tokens of the n-gram that are a non-type.
  5. The apparatus (100) of claim 2, wherein the apparatus (100) is configured to determine for each of the plurality of n-grams (401) , using the sequence tagging model (111) , the third score based on those probabilities that each of the one or more tokens of the n-gram is a type that exceed a threshold probability.
  6. The apparatus (100) of any one of the preceding claims, wherein the one or more composite scores are one or more weighted sums of the one or more first scores, the one or more second scores and the third score of each of the plurality of n-grams.
  7. The apparatus (100) of any one of the preceding claims, wherein the apparatus (100) is further configured to:
    determine for each of the plurality of n-grams (401) , using a learning-based matching scheme (711) , one or more fourth scores and one or more fourth candidate types from the pre-defined vocabulary of candidate types; and
    determine for each of the plurality of n-grams (401) the one or more composite scores based on the one or more first scores, the one or more second scores, the third score and the one or more fourth scores.
  8. The apparatus (100) of claim 7, wherein for determining the one or more fourth scores for each of the plurality of n-grams (401) using the learning-based matching scheme the apparatus (100) is configured to:
    determine an embedding vector for the text sample from the probabilities of each of the one or more of its tokens using a first neural network (833) ;
    determine an embedding vector for the n-gram and a plurality of further embedding vectors for the plurality of candidate types from the pre-defined vocabulary of candidate types; and
    feed the embedding vector for the text sample, the embedding vector for the n-gram and the further plurality of embedding vectors into a second feed-forward neural network (813) for determining the one or more fourth scores.
  9. The apparatus (100) of any one of the preceding claims, wherein for determining the one or more first scores for each of the plurality of n-grams using the term-based matching scheme (107) the apparatus (100) is configured to:
    search based on the one or more tokens of the n-gram for the one or more first candidate types using an inverted index mapping tokens to types;
    determine a search score for each of the one or more first candidate types; and
    determine one or more of the largest search scores of the one or more first candidate types as the one or more first scores for the n-gram.
  10. The apparatus (100) of claim 9, wherein the apparatus (100) is configured to generate the inverted index by:
    associating each of the plurality of first candidate types with one or more documents from a knowledge base describing the respective first candidate type by means of one or more tokens; and
    generating the inverted index by mapping each of the one or more tokens of the one or more documents from the knowledge base to the respective first candidate type.
  11. The apparatus (100) of claim 9 or 10, wherein for determining the one or more first scores for each of the plurality of n-grams (401) using the term-based matching scheme  (107) the apparatus (100) is further configured to normalize the one or more first scores for each of the n-grams (401) based on the number of tokens of the respective n-gram.
  12. The apparatus (100) of any one of the preceding claims, wherein for determining the one or more second scores for each of the plurality of n-grams (401) using the vector-based matching scheme (109) the apparatus (100) is configured to:
    determine an embedding vector for the n-gram;
    determine a plurality of similarity scores between the embedding vector for the n-gram and a plurality of further embedding vectors, wherein the plurality of further embedding vectors are based on the plurality of second candidate types from the pre-defined dictionary of candidate types; and
    determine one or more of the largest similarity scores as the one or more second scores for the n-gram.
  13. The apparatus (100) of claim 12, wherein for each n-gram the apparatus (100) is configured to determine the embedding vector for the n-gram as a weighted sum of a plurality of token vectors, wherein each token vector is associated with one of the one or more tokens of the n-gram.
  14. The apparatus (100) of claim 13, wherein the text sample is a non-segmented text sample and wherein the apparatus is further configured to segment the text sample into one or more tokens.
  15. The apparatus (100) of claim 14, wherein the apparatus (100) is further configured to:
    concatenate the tokens of the n-gram for obtaining a composite token comprising a continuous string of characters;
    retrieve a composite vector associated with the composite token from a dictionary of vectors;
    determine the embedding vector as the average vector of the composite vector and the weighted sum of the plurality of token vectors.
  16. The apparatus (100) of any one of the preceding claims, wherein the text sample (101) is a query (101) and wherein the apparatus (100) is configured to provide one or more query results based on one or more of the plurality of n-grams ranked based on the one or more composite scores.
  17. A method (900) for processing a text sample (101) , wherein the method (900) comprises:
    parsing (901) a plurality of tokens from the text sample (101) ;
    generating (903) a plurality of n-grams (401) based on the plurality of tokens, wherein each n-gram comprises one or more of the plurality of tokens;
    determining (905) for each of the plurality of n-grams (401) , using a term-based matching scheme, one or more first scores and one or more first candidate types from a pre-defined vocabulary of candidate types;
    determining (907) for each of the plurality of n-grams (401) , using a vector-based matching scheme, one or more second scores and one or more second candidate types from the pre-defined vocabulary of candidate types;
    determining (909) for each of the plurality of n-grams (401) , using a sequence tagging model, a third score;
    determining (911) for each of the plurality of n-grams (401) one or more composite scores based on the one or more first scores, the one or more second scores and the third score; and
    ranking (913) the plurality of n-grams (401) based on the one or more composite scores.
  18. A computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (900) of claim 17, when the program code is executed by the computer or the processor.
PCT/CN2021/074859 2021-02-02 2021-02-02 Apparatus and method for type matching of a text sample WO2022165634A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/074859 WO2022165634A1 (en) 2021-02-02 2021-02-02 Apparatus and method for type matching of a text sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/074859 WO2022165634A1 (en) 2021-02-02 2021-02-02 Apparatus and method for type matching of a text sample

Publications (1)

Publication Number Publication Date
WO2022165634A1 true WO2022165634A1 (en) 2022-08-11

Family

ID=82740675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074859 WO2022165634A1 (en) 2021-02-02 2021-02-02 Apparatus and method for type matching of a text sample

Country Status (1)

Country Link
WO (1) WO2022165634A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103250129A (en) * 2010-09-24 2013-08-14 国际商业机器公司 Providing question and answers with deferred type evaluation using text with limited structure
US20200175228A1 (en) * 2018-05-01 2020-06-04 Capital One Services, Llc Text categorization using natural language processing
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103250129A (en) * 2010-09-24 2013-08-14 国际商业机器公司 Providing question and answers with deferred type evaluation using text with limited structure
US20200175228A1 (en) * 2018-05-01 2020-06-04 Capital One Services, Llc Text categorization using natural language processing
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUANG, KAI; PAN, JIN-GUI: "An kNN Algorithm Based on Vector Angle for Multi-label Text Categorization", COMPUTER SCIENCE, no. 4, 25 April 2008 (2008-04-25), CN , pages 205 - 206, 297, XP009538731, ISSN: 1002-137X *

Similar Documents

Publication Publication Date Title
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US20090248595A1 (en) Name verification using machine learning
US20070219986A1 (en) Method and apparatus for extracting terms based on a displayed text
JP4778474B2 (en) Question answering apparatus, question answering method, question answering program, and recording medium recording the program
Kim et al. Statute law information retrieval and entailment
CN114090762B (en) Automatic question-answering method and system in futures field
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
Huang et al. An approach on Chinese microblog entity linking combining baidu encyclopaedia and word2vec
CN111428031A (en) Graph model filtering method fusing shallow semantic information
Scharpf et al. Arqmath lab: An incubator for semantic formula search in zbmath open?
CN110008312A (en) A kind of document writing assistant implementation method, system and electronic equipment
CN111858850A (en) Method for realizing accurate and rapid scoring of question and answer on intelligent customer service
WO2022165634A1 (en) Apparatus and method for type matching of a text sample
Al-Zoghby et al. Conceptual search for Arabic web content
Sagum et al. FICOBU: Filipino WordNet construction using decision tree and language modeling
Khan et al. Intelligent combination of approaches towards improved bangla text summarization
Wang et al. A telecom-domain online customer service assistant based on question answering with word embedding and intent classification
Boukhatem et al. Empirical comparison of semantic similarity measures for technical question answering
Brumer et al. Predicting relevance scores for triples from type-like relations using neural embedding-the cabbage triple scorer at wsdm cup 2017
Nikolić et al. Modelling the System of Receiving Quick Answers for e-Government Services: Study for the Crime Domain in the Republic of Serbia
Romero-Córdoba et al. A comparative study of soft computing software for enhancing the capabilities of business document management systems
CN116127053B (en) Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices
Dan et al. Which techniques does your application use?: An information extraction framework for scientific articles
Chen et al. NCU IISR System for NTCIR-11 MedNLP-2 Task.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21923664

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21923664

Country of ref document: EP

Kind code of ref document: A1