WO2020131004A1 - Traitement automatisé indépendant du domaine de texte en forme libre - Google Patents

Traitement automatisé indépendant du domaine de texte en forme libre Download PDF

Info

Publication number
WO2020131004A1
WO2020131004A1 PCT/US2017/068911 US2017068911W WO2020131004A1 WO 2020131004 A1 WO2020131004 A1 WO 2020131004A1 US 2017068911 W US2017068911 W US 2017068911W WO 2020131004 A1 WO2020131004 A1 WO 2020131004A1
Authority
WO
WIPO (PCT)
Prior art keywords
documents
candidate key
phrase
textual input
document
Prior art date
Application number
PCT/US2017/068911
Other languages
English (en)
Inventor
Ahmet AKYAMAC
Rajarshi BHOWMIK
Original Assignee
Nokia Technologies Oy
Nokia Usa Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy, Nokia Usa Inc. filed Critical Nokia Technologies Oy
Priority to PCT/US2017/068911 priority Critical patent/WO2020131004A1/fr
Publication of WO2020131004A1 publication Critical patent/WO2020131004A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing

Definitions

  • the present invention relates generally to data analytics, and, more particularly, to a text processing pipeline that is domain-independent, automated, and unsupervised.
  • Prior text processing and categorization techniques can be broadly classified into two categories: rule-based approaches and semantic/linguistic property based approaches.
  • Rule-based approaches require significant domain knowledge to extract key information from text.
  • a regular expression based pattern extraction technique for log analysis was proposed in“Semi-automatic Discovery of Extraction Patterns for Log Analysis”, (white paper by Carasso, David. 2007, available from Splunk, Inc. at www.splunk.com) where the natural language processing (NLP) employs a natural language toolkit (NLTK) supporting part of speech (POS) pattern matching using regular expressions.
  • NLTK natural language toolkit
  • POS part of speech
  • Categorization of unstructured text data depends heavily on the extraction of keywords or phrases that entail identifying the key concept of the text.
  • automatic key-phrase extraction is an established line of approaches which can be broadly categorized into two classes: supervised and unsupervised.
  • supervised approaches see: Riloff, Ellen, and Wendy Lehnert. 1994, Information extraction as a basis for high-precision text classification" , ACM Transactions on Information Systems (TOIS) 296-333; Turney, Peter D.
  • unsupervised approaches for key phrase extraction can be categorized into several groups (see, Hasan, Kazi Saidul, and Vincent Ng. 2014.“Automatic Keyphrase Extraction: A Survey of the State of the Art’, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland: Association for Computational Linguistics. 1262-1273).
  • a ranking of candidate phrases is obtained based on intra-document word co-occurrence statistics (see, Mihalcea, Rada, and Paul Tarau, 2004,“TextRank: Bringing Order into Text’, EMNLP, 404- 411 ) which uses a graph-based ranking algorithm to assign ranking scores to candidate key phrases.
  • Topical Key Phrase Ranking is more suitable for longer text documents where each document is typically a mixture of multiple topics. For short length text document that usually covers a single topic (e.g. a customer care ticket talks about one particular problem, one line of a machine log tells about one particular event), the quality of the extracted key phrases is very poor. Another drawback of Topical Key Phrase ranking is, lesser represented topics in the corpus are difficult to discover.
  • a method and apparatus are provided for improved free-form text processing that is domain agnostic and automatically identifies key topics for textual data which does not require prior training or supervision, pre-labelling or annotation of the data, or domain expertise.
  • a multi-stage process i.e. , a text processing pipeline
  • the text processing pipeline leverages both intra-document and corpus-wide (i.e., inter document) word occurrence statistics for high-quality key phrase extraction.
  • Automatic feature representation is performed using the extracted key phrases and dimensionality reduction is applied that minimizes the distance between near-similar documents in the vector space.
  • the text processing pipeline assigns either a hierarchical category or a topic identified by topical key words.
  • the free-form text processing achieved by the embodiments are (i) domain-independent: the technique is applicable to a diverse set of free-form text data; (ii) automated: automatically extracted key phrases and their corresponding ranking scores are used as features for each document; and there is no requirement of any human intervention (for data cleaning, for example) and/or any domain expertise (for feature engineering, for example); (iii) unsupervised: all stages in the text processing pipeline are unsupervised and do not require any annotated or labeled training data (thereby eliminating the need to obtain such training data which in many cases is difficult to reliably obtain); (iv) leveraging both intra-document and corpus-wide occurrence statistics; and (v) improving decision making: the text processing pipeline is able to decide whether a text document is categorical or not based on the density of the clusters generated by a hierarchical clustering of the text documents.
  • the text processing pipeline assigns a categorical label to a text document if it is a part of a very dense cluster, otherwise, if the cluster density is below a threshold value, the text processing pipeline assigns a topic to the text document which is identified by a set of topical key words.
  • a combination of intra-document and corpus-wide co-occurrence statistics are leveraged to identify key phrases of variable length.
  • a single document key phrase extraction technique is applied to identify possible key phrases from each text document solely based on the intra-document word co occurrence statistics.
  • a list of common stop words, lexical markers and punctuations is provided as phrase delimiters.
  • phrase delimiters Based on such phrase delimiters, a set of candidate phrases is generated.
  • the candidate phrases are then split, illustratively, into words for constructing the word co-occurrence matrix.
  • the candidate key phrases are then scored and ranked using this single document word co-occurrence matrix.
  • each text document is represented as a sparse, high-dimensional vector using the extracted relevant key phrases and their respective ranking scores as features, and a dimensionality reduction is performed using a locality-sensitive hashing technique.
  • an agglomerative hierarchical clustering technique is applied to cluster similar text documents to facilitate the assignment of both topics and sub-topics to enhance the clustering results, and a measure of the density of the clusters generated is made. If the cluster density is higher than a pre-defined threshold value, the text documents are annotated to belong to that cluster, and the cluster is considered to be categorical. Otherwise, the cluster is not considered categorical. Instead, the top-K most frequent phrases from all the documents within the cluster are selected as representatives of the topic for the cluster, and the text documents are annotated to be part of this topic.
  • FIG. 1 shows a flowchart of illustrative operations for automated domain-independent free-form text processing in accordance with an
  • FIG. 2 shows plate notation model for a Latent Dirichlet
  • LDA Allocation Allocation
  • FIG. 3 shows a flowchart of illustrative operations for word generation for text documents in accordance with an embodiment
  • FIGs. 4A - 4I shows an illustrative scenario and exemplary results in applying automated domain-independent free-form text processing in a technical service ticket/customer care domain in accordance with an embodiment
  • FIG. 5 is a high-level block diagram of an exemplary computer in accordance with an embodiment.
  • a method and apparatus are provided for improved free-form text processing that is domain agnostic and automatically identifies key topics for textual data which does not require prior training or supervision, pre-labelling or annotation of the data, or domain expertise.
  • a text processing pipeline is utilized that is domain-independent, automated, and unsupervised.
  • the text processing pipeline leverages both intra-document and corpus-wide word occurrence statistics for high-quality key phrase extraction. Automatic feature representation is performed using the extracted key phrases and dimensionality reduction is provided that minimizes the distance between near-similar documents in the vector space.
  • FIG. 1 shows a flowchart of illustrative operations 100 for automated domain-independent free-form text processing in accordance with an embodiment.
  • the processing of the free-form text occurs, illustratively, in a multi-stage text processing pipeline which will now be discussed in more detail.
  • the textual data input that is to be processed is received, presented, or otherwise provided (e.g., textual free-form, unstructured and semi-structured data from customer care tickets, surveys, social media, machine logs, alarm and alerting systems, and diagnostics, to name just a few input types).
  • the textual data input may be in the form of one or more so-called text documents (e.g., a technical/customer care service ticket).
  • key-phrase extraction is applied to each text document as each document is essentially a collection of words.
  • RAKE is applied to identify the extracted key-phrases from each text document, at step 115.
  • Steps 110 and 115 illustratively comprise a single document key-phrase extraction state (i.e. , intra-document processing; Stage 1 ).
  • a list of common stop words, lexical markers and punctuations is provided as phrase delimiters.
  • a set of candidate key phrases are generated using any number of well-known automated keyword extraction techniques such as RAKE. For example, in a customer care application these key phrases might be“service outage”, “technical dispatched” or“instruction manual not adhered to”, to name just a few.
  • the candidate key phrases are then split, illustratively, into words for constructing the so-called word co-occurrence matrix.
  • RAKE uses two measures - the frequency and degree of a word to compute the score of a candidate key phrase.
  • the degree of a word is measured as the sum of its frequency in the document and the number of its co-occurrence with other words across all candidate key phrases of the text document. If a candidate phrase has multiple words, the scores of all the words are summed up.
  • a descending order ranking is obtained for the candidate phrases and a selection is made of some percentage (illustratively, the top seventy percent (70%)) thereof are selected as the extracted key-phrases.
  • Stage 1 i.e. , the intra-document processing
  • Stage 2 i.e., the inter-document processing
  • these key words and key phrase are akin to a generated“dictionary” for further use in the automated processing of the free- text input documents.
  • a word distribution of topics and topic distribution of the documents is obtained. Illustratively these are obtained using a variant of Latent Dirichlet Allocation (LDA).
  • Steps 125-130 represent a topical key word ranking stage (i.e., Stage 2).
  • LDA is a well- known generative probabilistic model for collections of discrete data such as text corpora.
  • LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
  • FIG. 2 a variant of LDA is shown (as specially configured for the embodiments herein) in plate notation model 200. With plate notation, the dependencies among the many variables can be captured concisely. The boxes are so-called “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words (e.g., word W m> n 255) within a document.
  • M 235 denotes the number of documents, N 240 the number of words in a document in accordance with the embodiment shown.
  • the domain specific jargons are assumed to have a Dirichlet prior cp D 205 parameterized by b 210
  • the topic distribution r 215 is assumed to have a Dirichlet prior parameterized by a 220
  • p 250 is the Dirichlet prior with parameter g 220 for choosing between topic words and domain specific jargons.
  • a word distribution is assumed with a Dirichlet prior cp k parameterized by b.
  • z dm 225 denotes the topic of the m-th text document in the corpus and y m,n 230 is a binary variable indicating whether the n-th word of the m-th document is a topic word or a domain specific jargon.
  • plate notation model 200 is a specially configured LDA variant for use in the embodiments herein. Additionally, there are certain domain specific jargons which are not associated to a specific topic. However, such words cannot be removed from the text as they bear significant semantic meaning. In the LDA variant embodiment shown in FIG. 2, an assumption is made that each word in each text document is either a domain specific jargon or a topic word.
  • FIG. 3 is flowchart of operations 300 detailing the word generation process for each text document.
  • the domain specific jargons are assumed to have a Dirichlet prior cp D 245 parameterized by b 210
  • the topic distribution t is assumed to have a Dirichlet prior parameterized by a
  • p is the Dirichlet prior with parameter g for choosing between topic words and domain specific jargons.
  • the Dirichlet prior cp D 245 is chosen.
  • steps 320 and 330 for each topic k, we assume a word distribution with a Dirichlet prior cp k parameterized by b.
  • Step 330 provides details of the generation process of each document d m.
  • K is the number of possible topics
  • T is the categorical distribution of topics with parameter T, illustratively, Categorical(T).
  • w m,n a binary random variable y m,n is sampled from the Bernoulli distribution parameterized by TT. If the value of the random variable y m,n is 0, then the word w m n is identified to be a domain specific jargon and hence, is sampled from the Categorical distribution of domain specific jargons with the Dirichlet prior (P D .
  • the word w m ,n is identified to be a topical word for the previously chosen topic of the document and is sampled from the Categorical distribution of the chosen topic with the Dirichlet prior cp k .
  • topical key phrase ranking is performed at step 135 (i.e., Stage 3).
  • the topical key word ranking obtained in the topical word ranking stage is used to obtain a ranking of the candidate key phrases generated in Stage 1.
  • Any automatic key phrase extraction technique may be utilized at step 135, illustratively, the technique described herein above in Liu, Zhiyuan, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010.
  • each text document is represented as a sparse, high-dimensional vector (i.e., a collection of features) using the extracted key phrases and their respective ranking scores as features, and step 145 (i.e., Stage 5) dimensionality reduction is performed using a locality-sensitive hashing technique (for example, the so-called“SimHash” technique described in Charikar, Moses S. 2002, “Similarity Estimation Techniques from Rounding Algorithms.” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, ACM, 380-388).
  • a locality-sensitive hashing technique for example, the so-called“SimHash” technique described in Charikar, Moses S. 2002, “Similarity Estimation Techniques from Rounding Algorithms.” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, ACM, 380-388).
  • step 145 may be optional given the resulting compression obtained thereof improves overall speed performance.
  • an agglomerative hierarchical clustering technique such as scipy.clustering. agglomerative is applied to cluster similar text documents. In this way, hierarchical clustering enables the automated assignment of both topics and sub-topics to enhance the clustering results such that the documents are clustered with each cluster being assigned specific key words and/or key phrases.
  • step 155 there is a generation of categories (and/or groups) of the text documents by topic that is facilitated by measuring the density of the clusters generated in Stage 6. If the cluster density is higher than a pre-defined threshold value, the text documents are annotated to belong to that cluster, and the cluster is considered to be categorical. Otherwise, the cluster is not considered categorical. Instead, the top-K most frequent phrases from all the documents within the cluster are selected as representatives of the topic for the cluster, and the text documents are annotated to be part of this topic.
  • the embodiments facilitate the free-form text processing that is automated, domain-independent, topic-independent, and document size-independent having a high quality level.
  • the embodiments herein may be utilized for a wide variety of use cases. Some examples include, but are not limited to, (i) categorization of unstructured machine log messages: (ii) understanding groups and main topics of problems faced by customers as described in free form text in customer care tickets; (iii) revealing groups of topics with positive, neutral and negative sentiments from free-form customer survey verbatims; and (iv) preparing unstructured textual information by either categorizing it, or extracting topical key phrases to be used in lieu of the full text as part of the pre-processing steps of automated machine learning.
  • FIGs. 4A - 4I shows an illustrative scenario and exemplary results in applying automated domain-independent free-form text processing in a technical service ticket/customer care domain in accordance with an embodiment.
  • FIG. 4A shows service ticket 400 which is textual data input (see, step 105 of FIG. 1 ) in accordance with the embodiment.
  • Service ticket 400 has original text 405 which is representative of a textual excerpt from an illustrative network operations trouble ticket (i.e. , ticket / in a set of n tickets, 1 ... n).
  • FIG. 4C shows such results for service ticket 400 where plurality of candidate key phrases 425 have been extracted which include the following exemplary key phrases:“procedure integrating emc vnx storage array” 425-1 ,“boot screen” 425- 2, and“local team” 425-3, to highlight just a few.
  • FIG. 4D shows results 430 from the completion of Stage 2 processing (as described herein above; see FIG. 1 ) for topical key word 430-1 and normalized ranking 430-2 having a plurality of key words 430-3 by normalized rank.
  • the top ranked key phrases may be identified in Stage 3 (as described herein above; see FIG. 1 ) with FIG. 4E showing results 435 for the top twelve (12) ranked key phrases (i.e., ranked key phrase 435-1 through 435-11 in descending order) out of a total of 79 key phrases for ticket 400 (i.e., ticket /).
  • feature vector 440 for ticket 400 i.e., ticket /
  • has 10,852 dimensions i.e., dimension 440-1 through 440-DM
  • each dimension representing a respective one of the 10,852 key phrases.
  • the feature vector for ticket 400 i.e., ticket /
  • the shaded boxes in FIG. 4F indicate key phrases that are found in ticket 400.
  • FIG. 4F shows the high dimensional representation of ticket 400.
  • This feature vector is sparse, wastes memory, and increases computational requirements. As such, in accordance with the embodiments herein, dimensionality reduction is applied in Stage 5 (as previously described herein above; see FIG. 1 ).
  • results 445 shown in FIG. 4G is a compact feature representation for ticket 400 after dimensionality reduction is applied in accordance with the embodiment.
  • the previous 10,852 dimensions i.e. , dimension 440-1 through 440-DM
  • have been significantly reduced to a set of thirty-two (32) dimensions i.e., dimension 445-1 through 445- DM) each dimension represented by bit 445-1 and value 445-3.
  • FIG. 4H shows results 450 which are the clusters (i.e., cluster 450-1 through 450-12 resulting from using hierarchical agglomerative clustering in Stage 6 (as described herein above; see FIG. 1 ).
  • ticket 400 is in cluster 450-10.
  • Stage 7 (as described herein above; see FIG. 1 ) processing results in a category or topic assignment.
  • the density of cluster 450-10 was found to be lower than the pre defined threshold value.
  • results 460 show no category (i.e., category 460-1 ) being assigned and topic assignment 465 having the assignment of topics 465-1 , 465-2, 465-3, 465-4, 465-5, 465-6, 465-7, 465-8, 465-9, and 465-10 for the cluster in which ticket 400 resides (i.e., cluster 450-10).
  • the text processing pipeline automatically assigns either a hierarchical category or a topic identified by topical key words (for example, “disk” or“system” as shown in results 460).
  • FIG. 5 is a high-level block diagram of an exemplary computer 500 that may be used for implementing a method for domain-independent automating processing of free-form text in accordance with the various embodiments herein.
  • Computer 500 comprises a processor 510 operatively coupled to a data storage device 520 and a memory 530.
  • Processor 510 controls the overall operation of computer 500 by executing computer program instructions that define such operations.
  • Communications bus 560 facilitates the coupling and communication between the various components of computer 500.
  • computer 500 may be any type of computing device such a computer, tablet, server, mobile device, smart phone, to name just a few.
  • the computer program instructions may be stored in data storage device 520, or a non-transitory computer readable medium, and loaded into memory 530 when execution of the computer program instructions is desired.
  • the steps of the disclosed method can be defined by the computer program instructions stored in memory 530 and/or data storage device 520 and controlled by processor 510 executing the computer program instructions.
  • the computer program instructions can be implemented as computer executable code programmed by one skilled in the art to perform the illustrative operations defined by the disclosed method.
  • processor 510 executes an algorithm defined by the disclosed method.
  • Computer 500 also includes one or more communication interfaces 550 for communicating with other devices via a network (e.g., a wireless communications network) or communications protocol (e.g., Bluetooth®).
  • a network e.g., a wireless communications network
  • communications protocol e.g., Bluetooth®
  • Computer 500 also includes one or more input/output devices 540 that enable user interaction with the user device (e.g., camera, display, keyboard, mouse, speakers, microphone, buttons, etc.).
  • input/output devices 540 that enable user interaction with the user device (e.g., camera, display, keyboard, mouse, speakers, microphone, buttons, etc.).
  • Processor 510 may include both general and special purpose microprocessors, and may be the sole processor or one of multiple processors of computer 500.
  • Processor 510 may comprise one or more central processing units (CPUs), for example.
  • CPUs central processing units
  • Processor 510, data storage device 520, and/or memory 530 may include, be supplemented by, or incorporated in, one or more application-specific integrated circuits (ASICs) and/or one or more field programmable gate arrays (FPGAs).
  • ASICs application-specific integrated circuits
  • FPGAs field programmable gate arrays
  • Data storage device 520 and memory 530 each comprise a tangible non-transitory computer readable storage medium.
  • Data storage device 520, and memory 530 may each include high-speed random access memory, such as dynamic random access memory (DRAM), static random access memory (SRAM), double data rate synchronous dynamic random access memory (DDR RAM), or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices such as internal hard disks and removable disks, magneto-optical disk storage devices, optical disk storage devices, flash memory devices, semiconductor memory devices, such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disc read-only memory (DVD-ROM) disks, or other non-volatile solid state storage devices.
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • DDR RAM double data rate synchronous dynamic random access memory
  • non-volatile memory such as one
  • Input/output devices 540 may include peripherals, such as a camera, printer, scanner, display screen, etc.
  • input/output devices 540 may include a display device such as a cathode ray tube (CRT), plasma or liquid crystal display (LCD) monitor for displaying information to the user, a keyboard, and a pointing device such as a mouse or a trackball by which the user can provide input to computer 500.
  • a display device such as a cathode ray tube (CRT), plasma or liquid crystal display (LCD) monitor for displaying information to the user
  • keyboard for displaying information to the user
  • a pointing device such as a mouse or a trackball by which the user can provide input to computer 500.
  • DSP digital signal processor
  • any flowcharts, flow diagrams, state transition diagrams, pseudo code, program code and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer, machine or processor, whether or not such computer, machine or processor is explicitly shown.
  • One skilled in the art will recognize that an implementation of an actual computer or computer system may have other structures and may contain other components as well, and that a high level representation of some of the components of such a computer is for illustrative purposes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un appareil pour un traitement de texte en forme libre amélioré qui est non informé du domaine et identifie automatiquement des sujets clé pour des données textuelles qui ne nécessitent pas d'apprentissage préalable ou de supervision, de pré-étiquetage ou d'annotation des données, ou d'une expertise de domaine. Un processus à étapes multiples (c'est-à-dire, un pipeline de traitement de texte) est utilisé et se trouve être indépendant du domaine, automatisé et non supervisé. Compte tenu d'un ensemble de documents textuels de forme libre, le pipeline de traitement de texte tire profit à la fois de statistiques d'occurrence de mots intradocuments et de corpus (c'est-à-dire, inter-documents) pour une extraction de phrases clés de haute qualité. Une représentation de caractéristique automatique est effectuée à l'aide des phrases clés extraites et une réduction de dimensionnalité est appliquée pour minimiser la distance entre des documents quasi-similaires dans l'espace vectoriel. Pour chaque document textuel, le pipeline de traitement de texte attribue soit une catégorie hiérarchique soit un sujet identifié au moyen de des mots clés topiques.
PCT/US2017/068911 2017-12-29 2017-12-29 Traitement automatisé indépendant du domaine de texte en forme libre WO2020131004A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2017/068911 WO2020131004A1 (fr) 2017-12-29 2017-12-29 Traitement automatisé indépendant du domaine de texte en forme libre

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2017/068911 WO2020131004A1 (fr) 2017-12-29 2017-12-29 Traitement automatisé indépendant du domaine de texte en forme libre

Publications (1)

Publication Number Publication Date
WO2020131004A1 true WO2020131004A1 (fr) 2020-06-25

Family

ID=71102291

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/068911 WO2020131004A1 (fr) 2017-12-29 2017-12-29 Traitement automatisé indépendant du domaine de texte en forme libre

Country Status (1)

Country Link
WO (1) WO2020131004A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464656A (zh) * 2020-11-30 2021-03-09 科大讯飞股份有限公司 关键词抽取方法、装置、电子设备和存储介质
CN113157908A (zh) * 2021-03-22 2021-07-23 北京邮电大学 一种展示社交媒体热点子话题的文本可视化方法
CN115630160A (zh) * 2022-12-08 2023-01-20 四川大学 一种基于半监督共现图模型的争议焦点聚类方法及系统
CN116522901A (zh) * 2023-06-29 2023-08-01 金锐同创(北京)科技股份有限公司 It社群的关注信息的分析方法、装置、设备和介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020131644A1 (en) * 2001-01-31 2002-09-19 Hiroaki Takebe Pattern recognition apparatus and method using probability density function
US8131735B2 (en) * 2009-07-02 2012-03-06 Battelle Memorial Institute Rapid automatic keyword extraction for information retrieval and analysis
US20130311505A1 (en) * 2011-08-31 2013-11-21 Daniel A. McCallum Methods and Apparatus for Automated Keyword Refinement
US20140304267A1 (en) * 2008-05-07 2014-10-09 City University Of Hong Kong Suffix tree similarity measure for document clustering
WO2016085409A1 (fr) * 2014-11-24 2016-06-02 Agency For Science, Technology And Research Procédé et système de classification de sentiments et de classification d'émotions
US20160350404A1 (en) * 2015-05-29 2016-12-01 Intel Corporation Technologies for dynamic automated content discovery
US20170351676A1 (en) * 2016-06-02 2017-12-07 International Business Machines Corporation Sentiment normalization using personality characteristics
US20170364587A1 (en) * 2016-06-20 2017-12-21 International Business Machines Corporation System and Method for Automatic, Unsupervised Contextualized Content Summarization of Single and Multiple Documents

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020131644A1 (en) * 2001-01-31 2002-09-19 Hiroaki Takebe Pattern recognition apparatus and method using probability density function
US20140304267A1 (en) * 2008-05-07 2014-10-09 City University Of Hong Kong Suffix tree similarity measure for document clustering
US8131735B2 (en) * 2009-07-02 2012-03-06 Battelle Memorial Institute Rapid automatic keyword extraction for information retrieval and analysis
US20130311505A1 (en) * 2011-08-31 2013-11-21 Daniel A. McCallum Methods and Apparatus for Automated Keyword Refinement
WO2016085409A1 (fr) * 2014-11-24 2016-06-02 Agency For Science, Technology And Research Procédé et système de classification de sentiments et de classification d'émotions
US20160350404A1 (en) * 2015-05-29 2016-12-01 Intel Corporation Technologies for dynamic automated content discovery
US20170351676A1 (en) * 2016-06-02 2017-12-07 International Business Machines Corporation Sentiment normalization using personality characteristics
US20170364587A1 (en) * 2016-06-20 2017-12-21 International Business Machines Corporation System and Method for Automatic, Unsupervised Contextualized Content Summarization of Single and Multiple Documents

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464656A (zh) * 2020-11-30 2021-03-09 科大讯飞股份有限公司 关键词抽取方法、装置、电子设备和存储介质
CN112464656B (zh) * 2020-11-30 2024-02-13 中国科学技术大学 关键词抽取方法、装置、电子设备和存储介质
CN113157908A (zh) * 2021-03-22 2021-07-23 北京邮电大学 一种展示社交媒体热点子话题的文本可视化方法
CN115630160A (zh) * 2022-12-08 2023-01-20 四川大学 一种基于半监督共现图模型的争议焦点聚类方法及系统
CN115630160B (zh) * 2022-12-08 2023-07-07 四川大学 一种基于半监督共现图模型的争议焦点聚类方法及系统
CN116522901A (zh) * 2023-06-29 2023-08-01 金锐同创(北京)科技股份有限公司 It社群的关注信息的分析方法、装置、设备和介质
CN116522901B (zh) * 2023-06-29 2023-09-15 金锐同创(北京)科技股份有限公司 It社群的关注信息的分析方法、装置、设备和介质

Similar Documents

Publication Publication Date Title
US11475319B2 (en) Extracting facts from unstructured information
US8108413B2 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
Sawyer et al. Shallow knowledge as an aid to deep understanding in early phase requirements engineering
Inzalkar et al. A survey on text mining-techniques and application
JP2023052502A (ja) 機械学習モデルを迅速に構築し、管理し、共有するためのシステム及び方法
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
WO2020131004A1 (fr) Traitement automatisé indépendant du domaine de texte en forme libre
JP2015505082A (ja) 情報ドメインに対する自然言語処理モデルの生成
US9460071B2 (en) Rule development for natural language processing of text
US20170109358A1 (en) Method and system of determining enterprise content specific taxonomies and surrogate tags
WO2022126944A1 (fr) Procédé de regroupement de textes, dispositif électronique et support de stockage
Forman et al. Pragmatic text mining: minimizing human effort to quantify many issues in call logs
CN111353045B (zh) 构建文本分类体系的方法
Sangodiah et al. Taxonomy Based Features in Question Classification Using Support Vector Machine.
Bollegala et al. ClassiNet--Predicting missing features for short-text classification
Alsaidi et al. English poems categorization using text mining and rough set theory
CN114239828A (zh) 一种基于因果关系的供应链事理图谱构建方法
Li et al. Tagdeeprec: tag recommendation for software information sites using attention-based bi-lstm
Sharma et al. Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation
Uskenbayeva et al. Creation of Data Classification System for Local Administration
Punitha et al. Partition document clustering using ontology approach
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media
Mishra et al. Fault log text classification using natural language processing and machine learning for decision support
Violos et al. Clustering documents using the 3-gram graph representation model
Bhowmik et al. Domain-independent automated processing of free-form text data in telecom

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17936996

Country of ref document: EP

Kind code of ref document: A1