US20190370332A1 - Semantic textual similarity system - Google Patents

Semantic textual similarity system Download PDF

Info

Publication number
US20190370332A1
US20190370332A1 US15/993,893 US201815993893A US2019370332A1 US 20190370332 A1 US20190370332 A1 US 20190370332A1 US 201815993893 A US201815993893 A US 201815993893A US 2019370332 A1 US2019370332 A1 US 2019370332A1
Authority
US
United States
Prior art keywords
text
lstm
branch
level
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/993,893
Other versions
US10606956B2 (en
Inventor
Bernt ANDRASSY
Pankaj Gupta
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Priority to US15/993,893 priority Critical patent/US10606956B2/en
Publication of US20190370332A1 publication Critical patent/US20190370332A1/en
Assigned to SIEMENS AKTIENGESELLSCHAFT reassignment SIEMENS AKTIENGESELLSCHAFT ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ANDRASSY, BERNT, GUPTA, PANKAJ
Application granted granted Critical
Publication of US10606956B2 publication Critical patent/US10606956B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06F17/2785
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F17/30616
    • G06F17/30634
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Present invention concerns a Semantic Textual Similarity System.
  • LSTM Long Short Term Memory
  • RNNs Recurring Neural Networks
  • it may be attempted to identify text elements in a text corpus that are similar to an input text element based on similarity learning performed on a first text corpus of texts comparable, e.g. in structure and/or content, to the input text element, and a second text corpus with associated text elements.
  • the system comprises a first Long Short Term Memory, LSTM, branch adapted to be operative, to determine text similarity, on a first text corpus comprising a plurality of first text elements, wherein each first text element has a first number of distinct subdivisions.
  • the system also comprises a second LSTM branch adapted to be operative, to determine text similarity, on a second text corpus, the second text corpus comprising a plurality of second text elements, wherein each second text element has a second number of distinct subdivisions.
  • the first LSTM branch comprises for each of the first number of distinct subdivisions a first branch LSTM level.
  • Each first branch LSTM level is adapted to be operative, for each of the first text elements, on an associated subdivision of the first text element utilising first weights to determine a hidden state vector associated to the first branch LSTM level.
  • Each first weight is associated to a subelement of a subdivision of a first text element.
  • the second LSTM branch comprises for each of the second number of distinct subdivisions a second branch LSTM level.
  • Each second branch LSTM level is adapted to be operative, for each of the second text elements, on an associated subdivision utilising a plurality of second weights to determine a hidden state vector associated to the second branch LSTM level.
  • Each second weight is associated to a subelement of a subdivision of a second text element.
  • the first weights and second weights are shared between the first LSTM branch and the second LSTM branch for iteratively determining similarity between first text elements and second text elements based on hidden state vectors. By sharing the weights between the levels, it is possible to improve the similarity determination in particular for text elements having distinct subdivisions with highly different structures.
  • a first LSTM branch level may be adapted to determine a hidden state vector based on second weights, e.g. due to the weights being shared.
  • a second LSTM branch level may be adapted to determine a hidden state vector based on first weights. This allows the system to provide improved context and similarity determination between the distinct subdivisions. All branch levels may be adapted accordingly.
  • a first LSTM branch level is adapted to determine a hidden state vector based on second weights from more than one second LSTM branch level, in particular from all second LSTM branch levels.
  • a second LSTM branch level is adapted to determine a hidden state vector based on first weights from more than one first LSTM branch level, in particular all first LSTM branch levels.
  • the first number of subdivisions is smaller than the second number subdivisions, such that the branches may have different numbers of levels. Accordingly, differently structured text elements may be treated.
  • the first number may be 2, and the second number may be 3.
  • a subdivision of a first text element and/or a second text element may consist of one sentence or phrase, e.g. a title or short description. Further subdivisions may be longer. Thus, asymmetric text elements and/or subdivisions may be handled.
  • each level of the first LSTM branch is connected to each level of the second LSTM branch for sharing weights. Connection between levels may be via suitable interfaces allowing communication, in particular sharing of weights.
  • the first and the second LSTM branches may be connected to a topic model, which may for example for learning latent representation of text elements, and/or for retrieval, and/or for evaluating similarity. Connection may be via suitable interfaces, e.g. for each level to a topic model.
  • Example topic models are Latent Dirichlet Allocation (LDA) models, Replicated Softmax (RSM), or Document Neural Autoregressive Distribution Estimator (DOCNADE), or a model based on DOCNADE.
  • the system may in general be adapted to determine similarity between an input text element and a second text element based on learning performed on the first text corpus and the second text corpus.
  • the learning may be performed using the LSTM branches.
  • the learning may provide a structured space for representing pairs of first and second text elements based on a similarity measure or score, which may be based on multi-level and/or cross-level and/or asymmetric text similarities.
  • multi-level may pertain to subdivisions of first and second elements of the same level, and cross-level of different levels.
  • Asymmetric may pertain to differences in text length of subdivisions, e.g. number of words and/or sentences.
  • Subdivisions may be considered asymmetric if their average lengths differ by a factor of at least 2, or at least 3, or at least 5. It may be considered that the system is adapted for retrieving for an input text element, e.g. a query, a set of one or more second text elements having the largest similarity.
  • an input text element e.g. a query, a set of one or more second text elements having the largest similarity.
  • the system may be adapted to evaluate similarity between first text elements and second text elements based on a plurality of channels.
  • a channel may provide a similarity measure or metric, e.g. based on a topic model, a sum-average approach and/or hidden vectors of LSTM branches and/or levels.
  • a generalised similarity metric based on the plurality of channels and/or associated metrics or measures may be utilised.
  • the system is adapted to evaluate similarity between first text elements and second text elements based on a Manhattan metric, which may be a generalised similarity metric. This facilitates reliable similarity measurement.
  • the first text elements may be queries for an industrial ticket system
  • the second text elements may represent a set of solutions for queried problems. It may be considered that the second text elements represent historical tickets with solutions.
  • the approaches allow in particular reliable retrieval of known solutions to historical queries for new queries.
  • An input text may be a query without solution.
  • the system may be implemented in hardware and/or software and/or firmware, e.g. formulated as a computer program product (non-transitory computer readable storage medium having instructions, which when executed by a processor, perform actions) with instructions causing processing circuitry to carry out and/or control the functionality of the system.
  • the above described system may comprise a computer system.
  • a method comprising the functional actions the system is adapted for may be considered, as well as a computer program comprising instructions causing a computer and/or processing circuitry to carry out and/or control a corresponding method.
  • a storage medium storing such a program is proposed.
  • the system comprises individual modules or subsystems for representing individual functionality, e.g. a LSTM module for each LSTM level of each branch, and/or associated topic model module/s and/or metric module/s and/or sum-average module/s.
  • a module generally may be implemented in software.
  • the system comprises and/or utilises integrated circuitry, in particular processing circuitry, for providing the functionalities.
  • Integrated circuitry or processing circuity may comprise one or more processors, e.g. microprocessor/s, and/or FPGAs (Field Programmable Gate Array) and/or ASICs (Application Specific Integrated Circuitry) and/or microcontrollers.
  • the circuitry may comprise, and/or be connected or connectable to, memory, e.g. transient and/or volatile and/or nonvolatile memory.
  • memory e.g. transient and/or volatile and/or nonvolatile memory.
  • Examples of memory comprise RAM (Random Access Memory), ROM (Read-Only Memory), cache memory, buffer memory, optical memory or magnetic memory.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • cache memory e.g. Random Access Memory
  • buffer memory e.g., optical memory or magnetic memory.
  • the system may be centralised or distributed, e.g. with different functionalities associated to different modules or units, e.g. communicating via suitable interfaces like communication interfaces or software interfaces.
  • a topic model and/or topical features determined thereon may be considered when computing similarity in asymmetric texts, along with semantic representations obtained from LSTM structures.
  • similarity may be determined based on the topic model and/or a topic representation determined based on the topic model.
  • LDA and DocNADE variants may be employed to compute topical features, which may extract abstract or summarized text or ticket representations. Jointly learning pairwisesimilarity in tickets or texts exploiting latent/hidden text representation and topical features together is proposed.
  • FIG. 1 shows an exemplary LSTM level structure
  • FIG. 2 shows an exemplary STS system.
  • a query may comprise distinct subdivisions, in particular a subject (SUB) and a description (DESC).
  • a knowledge base there are stored historical tickets t, which comprise as distinct subdivisions subject (SUB) and description (DESC) similar to a query, as well as a description of a solution (SOL).
  • the historical tickets t may be considered resolved queries.
  • subdivisions may be distinguished by a name, label or reference, and/or text structure, and/or style, and/or format.
  • the queries may be considered examples of first text elements, the set of queries a first text corpus.
  • the historical tickets may be considered examples of second text elements, the set of historical tickets may be considered an example of a second text corpus. Reference to the first text corpus may be labelled “1”, and references to the second text corpus may be labelled “2”.
  • a first text element may comprise the subdivisions (SUB 1 , DESC 1 ), a second text element (SUB 2 , DESC 2 , SOL 2 ).
  • SUB may be a short text, e.g.
  • a SOL or DESC may be significantly longer than the SUB.
  • the terminology in SUB and DESC may be closely related to each other, whereas a SOL, while being topically related to SUB and/or DESC, may be written using deviating terminology, e.g. due to being written by a technician solving the problem identified by another person in the SUB and DESC of a ticket.
  • Other text elements may be used, which may have different structures and/or styles, which analogously may comprise differently sized subdivisions and/or different number of subdivisions of text elements of different corpi.
  • a subelement of a text element and/or subdivision may be a sentence, or phrase, or in particular a word.
  • Elements of the same corpus may be considered to have the same structure of subdivisions, such that e.g. each first text element may have a SUB 1 and DESC 1 , and each second text element may have a SUB 2 , DESC 2 and SOL 2 , or be treated as such.
  • Text elements of different corpi may have different structure, in particular different numbers of subdivision. To a subdivision, a level may be associated. Subdivisions of text elements of different corpi having similar structural meaning or function (e.g., subject indication, or description) may be considered of a similar level.
  • FIG. 1 shows an exemplary LSTM structure 100 with two branches, which may be referred to as dual branch arrangement.
  • One branch is associated to the first text corpus and labelled 1 , the other to the second text corpus labelled 2 .
  • the first branch comprises for each subdivision of associated first text elements a LSTM module, in this example LSTM SUB1 and LSTM DESC1 .
  • the second branch comprises for each subdivision of associated second text elements a LSTM module, in this example LSTM SUB2 , LSTM DESC2 and LSTM SOL2 .
  • the text elements associated to the different (two) branches are structurally and stylistically distinct and different.
  • Each subdivision or associated module represents one level of the associated branch.
  • LSTM SUB1 is associated to LSTM SUB2
  • LSTM DESC1 is associated to LSTM DESC2
  • the LSTM branch arrangement may be implemented as a Siamese LSTM, which may have tied weights and an objective function, e.g. g( ) as described below as equation (1):
  • Approaches described herein comprise using LSTM to learn a highly structured space representation of each pair of text elements formed from a first text element and a second text element, which may include multi-level and cross-level textual similarities, in particular asymmetric similarities.
  • LSTM may be considered as form of Recurring Neural Network in which memory cells, respectively associated hidden vectors or hidden-state representations, are sequentially or iteratively updated.
  • memory state c t There may be utilised a memory state c t and three gates controlling flow of information over time or iteration steps.
  • an input gate i t controlling how much of an input x t is to be stored in memory
  • an output gate o t may control how much of c t should be exposed to the next node of the LSTM level
  • a forget gate ft may determine what should be forgotten.
  • Example dynamics for a LSTM level may be described as equations (2):
  • o t sigmoid( W o x t +U o h t-1 )
  • Each LSTM level learns a mapping from a space of variable length sequences of length T, to a hidden-state vector h, wherein each sequence may be extracted from the corresponding subdivision of a text element, and may comprise one or more subelements like words or phrases.
  • Each text element may of the corpi may undergo LSTM.
  • a sequence may in particular represent a sentence or half-sentence.
  • Each sequence or sentence with elements or words (w 1 , . . . , w t ) of a subdivision may be passed to the associated LSTM level, which updates the hidden-state vector h according to equations (2), resulting in a final encoded extracted hidden-state vector h.
  • W represent weights relating to input variables
  • U represent weights related to the hidden-state vector to be updated. Either can be shared and/or consider corresponding weights of one or more other levels.
  • a weight like e.g. W, may be determined based on one or more weights shared from one or more other levels, which may pertain to the same input value x i , corresponding to one w i .
  • Each subdivision may comprise sentences or sequences S, which may be indexed 1 . . . n, depending on the number of sequences in the subdivision. The arrangement may be referred to as replicated, due to sharing of the weights.
  • E in equation (1) may represent a sum-average over word embedding metric SumEMB, e.g. based on representing sentences or sequences as bag of words. For each branch level or subdivision, such a metric may be determined and consider for a generalised metric. Moreover, a topic model metric (T) may be provided for each subdivision or LSTM branch level.
  • SumEMB sum-average over word embedding metric
  • the different metrics h, T and E may be weighed with W h , W T , W E , respectively, to enter into generalized metric g( ), which may use a l 1 norm. Weights V may be associated to the different levels or subdivisions. g( ) may be considered a Multi-Channel Manhattan metric.
  • FIG. 2 shows an exemplary STS system 200 with a first LSTM branch comprising two levels LSTM SUB1 and LSTM DESC1 , into which sentences with elements w 1 . . . w A and w 1 . . . w B , respectively, are input.
  • This branch may be associated to a first text corpus, in particular to a set of queries q, which may have the subdivisions SUB 1 and DESC 1 .
  • a second LSTM branch comprises three levels LSTM SUB2 , LSTM DESC2 and LSTM SOL2 , exemplarily representing historical queries t, with input sentences w 1 . . . w C , w 1 . . . w D , w 1 .
  • each SUB subdivision has one sentence only, and that for the other subdivisions the number of sentences run from S 1 to S N , S 1 to S M , or S 1 to S p , respectively.
  • A-E and M to P may vary between respective text elements.
  • a hidden-state vector h is provided, as well as a metric E and a topic model metric T.
  • a sum-average metric module and/or a topic model there may be associated with a module adapted for serially or parallelly determine metrics of multiple LSTM, e.g. one topic model module and/or one sum-average metric module.
  • the metrics associated to different LSTM branch levels are passed to a generalised metric representing similarity learning, which may represent similarity of queries with historical queries with their associated solutions. This facilitates quick and reliable information retrieval for input queries, such that the most similar historical queries y may be retrieved, improving the chances of finding a correct solution that already was implemented for an input query.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A Semantic Textual Similarity System comprising a first Long Short Term Memory, LSTM, branch adapted to be operative, to determine text similarity, on a first text corpus, the first text corpus comprising a plurality of first text elements; wherein each first text element has a first number of distinct subdivisions. The system also comprises a second LSTM branch adapted to be operative, to determine text similarity, on a second text corpus, the second text corpus comprising a plurality of second text elements, wherein each second text element has a second number of distinct subdivisions.

Description

    FIELD OF TECHNOLOGY
  • Present invention concerns a Semantic Textual Similarity System.
  • BACKGROUND
  • Digital handling of texts like Natural Language Processing, e.g. Information Retrieval or text understanding, often is based on semantic analysis of text, in particular of semantic similarity. Machine learning and/or deep learning may be used for such tasks. Systems for semantic text analysis may be referred to as Semantic Textual Similarity Systems. Long Short Term Memory (LSTM) arrangements, a specific form of Recurring Neural Networks (RNNs) have been found to be useful in such systems, which can also be considered deep learning or machine learning systems. In one approach of text analysis, it may be attempted to identify text elements in a text corpus that are similar to an input text element based on similarity learning performed on a first text corpus of texts comparable, e.g. in structure and/or content, to the input text element, and a second text corpus with associated text elements.
  • SUMMARY
  • It is an advantage of the embodiments of the invention to provide improved approaches for a semantic textual similarity system, in particular in terms of reliable learning similarity utilising LSTM.
  • Accordingly, there is disclosed a Semantic Textual Similarity (STS) System. The system comprises a first Long Short Term Memory, LSTM, branch adapted to be operative, to determine text similarity, on a first text corpus comprising a plurality of first text elements, wherein each first text element has a first number of distinct subdivisions. The system also comprises a second LSTM branch adapted to be operative, to determine text similarity, on a second text corpus, the second text corpus comprising a plurality of second text elements, wherein each second text element has a second number of distinct subdivisions. The first LSTM branch comprises for each of the first number of distinct subdivisions a first branch LSTM level. Each first branch LSTM level is adapted to be operative, for each of the first text elements, on an associated subdivision of the first text element utilising first weights to determine a hidden state vector associated to the first branch LSTM level. Each first weight is associated to a subelement of a subdivision of a first text element. The second LSTM branch comprises for each of the second number of distinct subdivisions a second branch LSTM level. Each second branch LSTM level is adapted to be operative, for each of the second text elements, on an associated subdivision utilising a plurality of second weights to determine a hidden state vector associated to the second branch LSTM level. Each second weight is associated to a subelement of a subdivision of a second text element.
  • The first weights and second weights are shared between the first LSTM branch and the second LSTM branch for iteratively determining similarity between first text elements and second text elements based on hidden state vectors. By sharing the weights between the levels, it is possible to improve the similarity determination in particular for text elements having distinct subdivisions with highly different structures.
  • In particular, a first LSTM branch level may be adapted to determine a hidden state vector based on second weights, e.g. due to the weights being shared. Alternatively, or additionally, a second LSTM branch level may be adapted to determine a hidden state vector based on first weights. This allows the system to provide improved context and similarity determination between the distinct subdivisions. All branch levels may be adapted accordingly.
  • In particular, it may be considered that a first LSTM branch level is adapted to determine a hidden state vector based on second weights from more than one second LSTM branch level, in particular from all second LSTM branch levels. Alternatively, or additionally, it may be considered that a second LSTM branch level is adapted to determine a hidden state vector based on first weights from more than one first LSTM branch level, in particular all first LSTM branch levels. Such cross-level sharing allows improved determination of similarity even between very differently worded subdivisions.
  • It may be considered that the first number of subdivisions is smaller than the second number subdivisions, such that the branches may have different numbers of levels. Accordingly, differently structured text elements may be treated. In some cases, the first number may be 2, and the second number may be 3.
  • A subdivision of a first text element and/or a second text element may consist of one sentence or phrase, e.g. a title or short description. Further subdivisions may be longer. Thus, asymmetric text elements and/or subdivisions may be handled.
  • It may be considered that each level of the first LSTM branch is connected to each level of the second LSTM branch for sharing weights. Connection between levels may be via suitable interfaces allowing communication, in particular sharing of weights.
  • In general, the first and the second LSTM branches may be connected to a topic model, which may for example for learning latent representation of text elements, and/or for retrieval, and/or for evaluating similarity. Connection may be via suitable interfaces, e.g. for each level to a topic model. Example topic models are Latent Dirichlet Allocation (LDA) models, Replicated Softmax (RSM), or Document Neural Autoregressive Distribution Estimator (DOCNADE), or a model based on DOCNADE.
  • The system may in general be adapted to determine similarity between an input text element and a second text element based on learning performed on the first text corpus and the second text corpus. The learning may be performed using the LSTM branches. In general, the learning may provide a structured space for representing pairs of first and second text elements based on a similarity measure or score, which may be based on multi-level and/or cross-level and/or asymmetric text similarities. In general, multi-level may pertain to subdivisions of first and second elements of the same level, and cross-level of different levels. Asymmetric may pertain to differences in text length of subdivisions, e.g. number of words and/or sentences. Subdivisions may be considered asymmetric if their average lengths differ by a factor of at least 2, or at least 3, or at least 5. It may be considered that the system is adapted for retrieving for an input text element, e.g. a query, a set of one or more second text elements having the largest similarity.
  • The system may be adapted to evaluate similarity between first text elements and second text elements based on a plurality of channels. A channel may provide a similarity measure or metric, e.g. based on a topic model, a sum-average approach and/or hidden vectors of LSTM branches and/or levels. A generalised similarity metric based on the plurality of channels and/or associated metrics or measures may be utilised.
  • It may be considered that the system is adapted to evaluate similarity between first text elements and second text elements based on a Manhattan metric, which may be a generalised similarity metric. This facilitates reliable similarity measurement.
  • In some variants, the first text elements may be queries for an industrial ticket system, and the second text elements may represent a set of solutions for queried problems. It may be considered that the second text elements represent historical tickets with solutions. The approaches allow in particular reliable retrieval of known solutions to historical queries for new queries. An input text may be a query without solution.
  • The system may be implemented in hardware and/or software and/or firmware, e.g. formulated as a computer program product (non-transitory computer readable storage medium having instructions, which when executed by a processor, perform actions) with instructions causing processing circuitry to carry out and/or control the functionality of the system. The above described system may comprise a computer system. A method comprising the functional actions the system is adapted for may be considered, as well as a computer program comprising instructions causing a computer and/or processing circuitry to carry out and/or control a corresponding method. Also, a storage medium storing such a program is proposed.
  • It may be considered that the system comprises individual modules or subsystems for representing individual functionality, e.g. a LSTM module for each LSTM level of each branch, and/or associated topic model module/s and/or metric module/s and/or sum-average module/s. A module generally may be implemented in software. Alternatively, or additionally, it may be considered that the system comprises and/or utilises integrated circuitry, in particular processing circuitry, for providing the functionalities. Integrated circuitry or processing circuity may comprise one or more processors, e.g. microprocessor/s, and/or FPGAs (Field Programmable Gate Array) and/or ASICs (Application Specific Integrated Circuitry) and/or microcontrollers. The circuitry may comprise, and/or be connected or connectable to, memory, e.g. transient and/or volatile and/or nonvolatile memory. Examples of memory comprise RAM (Random Access Memory), ROM (Read-Only Memory), cache memory, buffer memory, optical memory or magnetic memory. The system may be centralised or distributed, e.g. with different functionalities associated to different modules or units, e.g. communicating via suitable interfaces like communication interfaces or software interfaces.
  • In general, a topic model and/or topical features determined thereon may be considered when computing similarity in asymmetric texts, along with semantic representations obtained from LSTM structures. In particular, similarity may be determined based on the topic model and/or a topic representation determined based on the topic model. LDA and DocNADE variants may be employed to compute topical features, which may extract abstract or summarized text or ticket representations. Jointly learning pairwisesimilarity in tickets or texts exploiting latent/hidden text representation and topical features together is proposed.
  • BRIEF DESCRIPTION
  • The above-described properties, features and advantages of present invention as well as the way they are achieved will be made clearer and better understandable in the light of the following discussion, making reference to exemplary embodiments shown in accompanying figures, in which
  • FIG. 1 shows an exemplary LSTM level structure;
  • FIG. 2 shows an exemplary STS system.
  • DETAILED DESCRIPTION
  • In the following, examples are explained in the context of an Industrial Ticketing System. In such systems, queries q are filed identifying technical issues to be fixed. A query may comprise distinct subdivisions, in particular a subject (SUB) and a description (DESC). In a knowledge base, there are stored historical tickets t, which comprise as distinct subdivisions subject (SUB) and description (DESC) similar to a query, as well as a description of a solution (SOL). The historical tickets t may be considered resolved queries. There may be u queries q1 . . . qu in a set of queries, which may be in a query base, and v historical tickets t1 . . . tv, wherein u and v may be different. In general, subdivisions may be distinguished by a name, label or reference, and/or text structure, and/or style, and/or format. The queries may be considered examples of first text elements, the set of queries a first text corpus. The historical tickets may be considered examples of second text elements, the set of historical tickets may be considered an example of a second text corpus. Reference to the first text corpus may be labelled “1”, and references to the second text corpus may be labelled “2”. Thus, a first text element may comprise the subdivisions (SUB1, DESC1), a second text element (SUB2, DESC2, SOL2). The sizes of the subdivisions of text elements may be different. In particular, SUB may be a short text, e.g. a sentence or phrase, indicating a problem and/or topic. A SOL or DESC may be significantly longer than the SUB. Often, the terminology in SUB and DESC may be closely related to each other, whereas a SOL, while being topically related to SUB and/or DESC, may be written using deviating terminology, e.g. due to being written by a technician solving the problem identified by another person in the SUB and DESC of a ticket. Other text elements may be used, which may have different structures and/or styles, which analogously may comprise differently sized subdivisions and/or different number of subdivisions of text elements of different corpi. A subelement of a text element and/or subdivision may be a sentence, or phrase, or in particular a word. Elements of the same corpus may be considered to have the same structure of subdivisions, such that e.g. each first text element may have a SUB1 and DESC1, and each second text element may have a SUB2, DESC2 and SOL2, or be treated as such. Text elements of different corpi may have different structure, in particular different numbers of subdivision. To a subdivision, a level may be associated. Subdivisions of text elements of different corpi having similar structural meaning or function (e.g., subject indication, or description) may be considered of a similar level.
  • FIG. 1 shows an exemplary LSTM structure 100 with two branches, which may be referred to as dual branch arrangement. One branch is associated to the first text corpus and labelled 1, the other to the second text corpus labelled 2. The first branch comprises for each subdivision of associated first text elements a LSTM module, in this example LSTMSUB1 and LSTMDESC1. The second branch comprises for each subdivision of associated second text elements a LSTM module, in this example LSTMSUB2, LSTMDESC2 and LSTMSOL2. As can be seen, the text elements associated to the different (two) branches are structurally and stylistically distinct and different. Each subdivision or associated module represents one level of the associated branch. To a level of one branch there may be associated, e.g. topically or structurally, a level of the other branch, such that LSTMSUB1 is associated to LSTMSUB2, LSTMDESC1 is associated to LSTMDESC2. There may be cross-level association, e.g. LSTMSUB1 to LSTMDESC2 and/or LSTMSOL2, and/or LSTMDESC1 to LSTMSUB2 and/or LSTMSOL2.
  • The LSTM branch arrangement may be implemented as a Siamese LSTM, which may have tied weights and an objective function, e.g. g( ) as described below as equation (1):

  • g(h,E,T,W h ,W E ,W T ,V)=exp(−Σp∈{SUB1,DESC1}Σq∈{SUB2,DESC2,SOL2} V {p,q}(W h ∥h p −h q1 +W E ∥E p −E q1 +W T ∥T p −T q1))  (1)
  • Approaches described herein comprise using LSTM to learn a highly structured space representation of each pair of text elements formed from a first text element and a second text element, which may include multi-level and cross-level textual similarities, in particular asymmetric similarities.
  • In general, LSTM may be considered as form of Recurring Neural Network in which memory cells, respectively associated hidden vectors or hidden-state representations, are sequentially or iteratively updated. There may be utilised a memory state ct and three gates controlling flow of information over time or iteration steps. In particular, there may be an input gate it controlling how much of an input xt is to be stored in memory, an output gate ot may control how much of ct should be exposed to the next node of the LSTM level, and a forget gate ft may determine what should be forgotten. Example dynamics for a LSTM level may be described as equations (2):

  • i t=sigmoid(W i x t +U i h t-1)

  • f t=sigmoid(W f x t +U f h t-1)

  • o t=sigmoid(W o x t +U o h t-1)

  • {tilde over (c)} t=tan h(W c x t +U c h t-1)

  • c t =i t ⊙{tilde over (c)} t +f t ⊙c t-1

  • h t =o t⊙ tan h(c t)
  • Each LSTM level learns a mapping from a space of variable length sequences of length T, to a hidden-state vector h, wherein each sequence may be extracted from the corresponding subdivision of a text element, and may comprise one or more subelements like words or phrases. Each text element may of the corpi may undergo LSTM. A sequence may in particular represent a sentence or half-sentence. Each sequence or sentence with elements or words (w1, . . . , wt) of a subdivision may be passed to the associated LSTM level, which updates the hidden-state vector h according to equations (2), resulting in a final encoded extracted hidden-state vector h. W represent weights relating to input variables, U represent weights related to the hidden-state vector to be updated. Either can be shared and/or consider corresponding weights of one or more other levels. In particular, a weight, like e.g. W, may be determined based on one or more weights shared from one or more other levels, which may pertain to the same input value xi, corresponding to one wi. Each subdivision may comprise sentences or sequences S, which may be indexed 1 . . . n, depending on the number of sequences in the subdivision. The arrangement may be referred to as replicated, due to sharing of the weights.
  • E in equation (1) may represent a sum-average over word embedding metric SumEMB, e.g. based on representing sentences or sequences as bag of words. For each branch level or subdivision, such a metric may be determined and consider for a generalised metric. Moreover, a topic model metric (T) may be provided for each subdivision or LSTM branch level.
  • The different metrics h, T and E, may be weighed with Wh, WT, WE, respectively, to enter into generalized metric g( ), which may use a l1 norm. Weights V may be associated to the different levels or subdivisions. g( ) may be considered a Multi-Channel Manhattan metric.
  • FIG. 2 shows an exemplary STS system 200 with a first LSTM branch comprising two levels LSTMSUB1 and LSTMDESC1, into which sentences with elements w1 . . . wA and w1 . . . wB, respectively, are input. This branch may be associated to a first text corpus, in particular to a set of queries q, which may have the subdivisions SUB1 and DESC1. A second LSTM branch comprises three levels LSTMSUB2, LSTMDESC2 and LSTMSOL2, exemplarily representing historical queries t, with input sentences w1 . . . wC, w1 . . . wD, w1 . . . wE. In this case it may be assumed that each SUB subdivision has one sentence only, and that for the other subdivisions the number of sentences run from S1 to SN, S1 to SM, or S1 to Sp, respectively. A-E and M to P may vary between respective text elements. For each subdivision, a hidden-state vector h is provided, as well as a metric E and a topic model metric T. Accordingly, to each LSTM branch level, there may be associated a sum-average metric module and/or a topic model. However, in some cases, there may be a module adapted for serially or parallelly determine metrics of multiple LSTM, e.g. one topic model module and/or one sum-average metric module. The metrics associated to different LSTM branch levels are passed to a generalised metric representing similarity learning, which may represent similarity of queries with historical queries with their associated solutions. This facilitates quick and reliable information retrieval for input queries, such that the most similar historical queries y may be retrieved, improving the chances of finding a correct solution that already was implemented for an input query.
  • Even though present invention has been illustrated and explained in detail above with reference to the preferred embodiments, the invention is not to be construed as limited to the given examples. Variants or alternate combinations of features given in different embodiments may be derived by a subject matter expert without exceeding the scope of present invention.

Claims (11)

1. A semantic Textual Similarity System comprising:
a first Long Short Term Memory, LSTM, branch adapted to be operative, to determine text similarity, on a first text corpus, the first text corpus comprising a plurality of first text elements; wherein each first text element has a first number of distinct subdivisions;
a second LSTM branch adapted to be operative, to determine text similarity, on a second text corpus, the second text corpus comprising a plurality of second text elements, wherein each second text element (t1 . . . tv) has a second number of distinct subdivisions;
wherein the first LSTM branch comprises for each of the first number of distinct subdivisions a first branch LSTM level, each first branch LSTM level being adapted to be operative, for each of the first text elements, on an associated subdivision of the first text element utilising first weights to determine a hidden state vector associated to the first branch LSTM level, each first weight being associated to a subelement of a subdivision of a first text element;
wherein the second LSTM branch comprises for each of the second number of distinct subdivisions a second branch LSTM level, each second branch LSTM level being adapted to be operative, for each of the second text elements, on an associated subdivision utilising a plurality of second weights to determine a hidden state vector associated to the second branch LSTM level; each second weight being associated to a subelement of a subdivision of a second text element;
wherein the first weights and second weights are shared between the first LSTM branch and the second LSTM branch for iteratively determining similarity between first text elements and second text elements based on hidden state vectors.
2. A system according to claim 1, wherein a first LSTM branch level is adapted to determine a hidden state vector based on second weights.
3. The system according to claim 1, wherein a first LSTM branch level is adapted to determine a hidden state vector based on second weights from more than one second LSTM branch level.
4. The system according to claim 1, wherein the first number of subdivisions is smaller than the second number of subdivisions.
5. The system according to claim 1, wherein a subdivision of a first text element and/or a second text element consists of one sentence.
6. The system according to claim 1, wherein each level of the first LSTM branch is connected to each level of the second LSTM branch for sharing weights.
7. The system according to claim 1, wherein the first and the second LSTM branches are connected to a topic model.
8. The system according to claim 1, wherein the system is adapted to determine similarity between an input text element and a second text element y based on learning performed on the first text corpus and the second text corpus.
9. The system according to claim 1, wherein the system is adapted to evaluate similarity between first text elements and second text elements based on a plurality of channels.
10. The system according to claim 1, wherein the system is adapted to evaluate similarity between first text elements and second text elements based on a Manhattan metric.
11. The system according to claim 1, wherein the first text elements are queries for an industrial ticket system, and the second text elements represent a set of solutions for queried problems.
US15/993,893 2018-05-31 2018-05-31 Semantic textual similarity system Active 2038-09-05 US10606956B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/993,893 US10606956B2 (en) 2018-05-31 2018-05-31 Semantic textual similarity system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/993,893 US10606956B2 (en) 2018-05-31 2018-05-31 Semantic textual similarity system

Publications (2)

Publication Number Publication Date
US20190370332A1 true US20190370332A1 (en) 2019-12-05
US10606956B2 US10606956B2 (en) 2020-03-31

Family

ID=68694095

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/993,893 Active 2038-09-05 US10606956B2 (en) 2018-05-31 2018-05-31 Semantic textual similarity system

Country Status (1)

Country Link
US (1) US10606956B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198949A (en) * 2020-04-10 2020-05-26 支付宝(杭州)信息技术有限公司 Text label determination method and system
CN112035607A (en) * 2020-08-19 2020-12-04 中南大学 MG-LSTM-based citation difference matching method, device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US8065307B2 (en) * 2006-12-20 2011-11-22 Microsoft Corporation Parsing, analysis and scoring of document content
WO2014169334A1 (en) * 2013-04-15 2014-10-23 Contextual Systems Pty Ltd Methods and systems for improved document comparison
US9443513B2 (en) * 2014-03-24 2016-09-13 Educational Testing Service System and method for automated detection of plagiarized spoken responses

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198949A (en) * 2020-04-10 2020-05-26 支付宝(杭州)信息技术有限公司 Text label determination method and system
CN112035607A (en) * 2020-08-19 2020-12-04 中南大学 MG-LSTM-based citation difference matching method, device and storage medium

Also Published As

Publication number Publication date
US10606956B2 (en) 2020-03-31

Similar Documents

Publication Publication Date Title
US11216620B1 (en) Methods and apparatuses for training service model and determining text classification category
CN110334354B (en) Chinese relation extraction method
US11321671B2 (en) Job skill taxonomy
CN111222305B (en) Information structuring method and device
CN104699763B (en) The text similarity gauging system of multiple features fusion
US11568132B2 (en) Phrase generation relationship estimation model learning device, phrase generation device, method, and program
CN106021223A (en) Sentence similarity calculation method and system
US11481560B2 (en) Information processing device, information processing method, and program
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
US11651166B2 (en) Learning device of phrase generation model, phrase generation device, method, and program
CN113011689B (en) Evaluation method and device for software development workload and computing equipment
CN110472062B (en) Method and device for identifying named entity
CN111143569A (en) Data processing method and device and computer readable storage medium
CN111401928A (en) Method and device for determining semantic similarity of text based on graph data
CN112686025A (en) Chinese choice question interference item generation method based on free text
US10606956B2 (en) Semantic textual similarity system
CN109885745A (en) A kind of user draws a portrait method, apparatus, readable storage medium storing program for executing and terminal device
Schicchi et al. Machine learning models for measuring syntax complexity of english text
CN113779249B (en) Cross-domain text emotion classification method and device, storage medium and electronic equipment
CN111259147A (en) Sentence-level emotion prediction method and system based on adaptive attention mechanism
CN113220900B (en) Modeling Method of Entity Disambiguation Model and Entity Disambiguation Prediction Method
CN111198949B (en) Text label determination method and system
CN111241848A (en) Article reading comprehension answer retrieval system and device based on machine learning
Tashu Off-topic essay detection using C-BGRU siamese
CN115934948A (en) Knowledge enhancement-based drug entity relationship combined extraction method and system

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

AS Assignment

Owner name: SIEMENS AKTIENGESELLSCHAFT, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ANDRASSY, BERNT;GUPTA, PANKAJ;SIGNING DATES FROM 20191126 TO 20191209;REEL/FRAME:051260/0410

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4