WO2020131004A1 - Traitement automatisé indépendant du domaine de texte en forme libre - Google Patents
Traitement automatisé indépendant du domaine de texte en forme libre Download PDFInfo
- Publication number
- WO2020131004A1 WO2020131004A1 PCT/US2017/068911 US2017068911W WO2020131004A1 WO 2020131004 A1 WO2020131004 A1 WO 2020131004A1 US 2017068911 W US2017068911 W US 2017068911W WO 2020131004 A1 WO2020131004 A1 WO 2020131004A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- documents
- candidate key
- phrase
- textual input
- document
- Prior art date
Links
- 238000012545 processing Methods 0.000 title abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 57
- 230000000699 topical effect Effects 0.000 claims abstract description 33
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 24
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000008569 process Effects 0.000 abstract description 8
- 238000012549 training Methods 0.000 abstract description 8
- 238000002372 labelling Methods 0.000 abstract description 5
- 238000013459 approach Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 7
- 238000013500 data storage Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 238000012517 data analytics Methods 0.000 description 3
- 238000004836 empirical method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- ZZUFCTLCJUWOSV-UHFFFAOYSA-N furosemide Chemical compound C1=C(Cl)C(S(=O)(=O)N)=CC(C(O)=O)=C1NCC1=CC=CO1 ZZUFCTLCJUWOSV-UHFFFAOYSA-N 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009885 systemic effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
- G06V30/242—Division of the character sequences into groups prior to recognition; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
- G06V30/274—Syntactic or semantic context, e.g. balancing
Definitions
- the present invention relates generally to data analytics, and, more particularly, to a text processing pipeline that is domain-independent, automated, and unsupervised.
- Prior text processing and categorization techniques can be broadly classified into two categories: rule-based approaches and semantic/linguistic property based approaches.
- Rule-based approaches require significant domain knowledge to extract key information from text.
- a regular expression based pattern extraction technique for log analysis was proposed in“Semi-automatic Discovery of Extraction Patterns for Log Analysis”, (white paper by Carasso, David. 2007, available from Splunk, Inc. at www.splunk.com) where the natural language processing (NLP) employs a natural language toolkit (NLTK) supporting part of speech (POS) pattern matching using regular expressions.
- NLTK natural language toolkit
- POS part of speech
- Categorization of unstructured text data depends heavily on the extraction of keywords or phrases that entail identifying the key concept of the text.
- automatic key-phrase extraction is an established line of approaches which can be broadly categorized into two classes: supervised and unsupervised.
- supervised approaches see: Riloff, Ellen, and Wendy Lehnert. 1994, Information extraction as a basis for high-precision text classification" , ACM Transactions on Information Systems (TOIS) 296-333; Turney, Peter D.
- unsupervised approaches for key phrase extraction can be categorized into several groups (see, Hasan, Kazi Saidul, and Vincent Ng. 2014.“Automatic Keyphrase Extraction: A Survey of the State of the Art’, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland: Association for Computational Linguistics. 1262-1273).
- a ranking of candidate phrases is obtained based on intra-document word co-occurrence statistics (see, Mihalcea, Rada, and Paul Tarau, 2004,“TextRank: Bringing Order into Text’, EMNLP, 404- 411 ) which uses a graph-based ranking algorithm to assign ranking scores to candidate key phrases.
- Topical Key Phrase Ranking is more suitable for longer text documents where each document is typically a mixture of multiple topics. For short length text document that usually covers a single topic (e.g. a customer care ticket talks about one particular problem, one line of a machine log tells about one particular event), the quality of the extracted key phrases is very poor. Another drawback of Topical Key Phrase ranking is, lesser represented topics in the corpus are difficult to discover.
- a method and apparatus are provided for improved free-form text processing that is domain agnostic and automatically identifies key topics for textual data which does not require prior training or supervision, pre-labelling or annotation of the data, or domain expertise.
- a multi-stage process i.e. , a text processing pipeline
- the text processing pipeline leverages both intra-document and corpus-wide (i.e., inter document) word occurrence statistics for high-quality key phrase extraction.
- Automatic feature representation is performed using the extracted key phrases and dimensionality reduction is applied that minimizes the distance between near-similar documents in the vector space.
- the text processing pipeline assigns either a hierarchical category or a topic identified by topical key words.
- the free-form text processing achieved by the embodiments are (i) domain-independent: the technique is applicable to a diverse set of free-form text data; (ii) automated: automatically extracted key phrases and their corresponding ranking scores are used as features for each document; and there is no requirement of any human intervention (for data cleaning, for example) and/or any domain expertise (for feature engineering, for example); (iii) unsupervised: all stages in the text processing pipeline are unsupervised and do not require any annotated or labeled training data (thereby eliminating the need to obtain such training data which in many cases is difficult to reliably obtain); (iv) leveraging both intra-document and corpus-wide occurrence statistics; and (v) improving decision making: the text processing pipeline is able to decide whether a text document is categorical or not based on the density of the clusters generated by a hierarchical clustering of the text documents.
- the text processing pipeline assigns a categorical label to a text document if it is a part of a very dense cluster, otherwise, if the cluster density is below a threshold value, the text processing pipeline assigns a topic to the text document which is identified by a set of topical key words.
- a combination of intra-document and corpus-wide co-occurrence statistics are leveraged to identify key phrases of variable length.
- a single document key phrase extraction technique is applied to identify possible key phrases from each text document solely based on the intra-document word co occurrence statistics.
- a list of common stop words, lexical markers and punctuations is provided as phrase delimiters.
- phrase delimiters Based on such phrase delimiters, a set of candidate phrases is generated.
- the candidate phrases are then split, illustratively, into words for constructing the word co-occurrence matrix.
- the candidate key phrases are then scored and ranked using this single document word co-occurrence matrix.
- each text document is represented as a sparse, high-dimensional vector using the extracted relevant key phrases and their respective ranking scores as features, and a dimensionality reduction is performed using a locality-sensitive hashing technique.
- an agglomerative hierarchical clustering technique is applied to cluster similar text documents to facilitate the assignment of both topics and sub-topics to enhance the clustering results, and a measure of the density of the clusters generated is made. If the cluster density is higher than a pre-defined threshold value, the text documents are annotated to belong to that cluster, and the cluster is considered to be categorical. Otherwise, the cluster is not considered categorical. Instead, the top-K most frequent phrases from all the documents within the cluster are selected as representatives of the topic for the cluster, and the text documents are annotated to be part of this topic.
- FIG. 1 shows a flowchart of illustrative operations for automated domain-independent free-form text processing in accordance with an
- FIG. 2 shows plate notation model for a Latent Dirichlet
- LDA Allocation Allocation
- FIG. 3 shows a flowchart of illustrative operations for word generation for text documents in accordance with an embodiment
- FIGs. 4A - 4I shows an illustrative scenario and exemplary results in applying automated domain-independent free-form text processing in a technical service ticket/customer care domain in accordance with an embodiment
- FIG. 5 is a high-level block diagram of an exemplary computer in accordance with an embodiment.
- a method and apparatus are provided for improved free-form text processing that is domain agnostic and automatically identifies key topics for textual data which does not require prior training or supervision, pre-labelling or annotation of the data, or domain expertise.
- a text processing pipeline is utilized that is domain-independent, automated, and unsupervised.
- the text processing pipeline leverages both intra-document and corpus-wide word occurrence statistics for high-quality key phrase extraction. Automatic feature representation is performed using the extracted key phrases and dimensionality reduction is provided that minimizes the distance between near-similar documents in the vector space.
- FIG. 1 shows a flowchart of illustrative operations 100 for automated domain-independent free-form text processing in accordance with an embodiment.
- the processing of the free-form text occurs, illustratively, in a multi-stage text processing pipeline which will now be discussed in more detail.
- the textual data input that is to be processed is received, presented, or otherwise provided (e.g., textual free-form, unstructured and semi-structured data from customer care tickets, surveys, social media, machine logs, alarm and alerting systems, and diagnostics, to name just a few input types).
- the textual data input may be in the form of one or more so-called text documents (e.g., a technical/customer care service ticket).
- key-phrase extraction is applied to each text document as each document is essentially a collection of words.
- RAKE is applied to identify the extracted key-phrases from each text document, at step 115.
- Steps 110 and 115 illustratively comprise a single document key-phrase extraction state (i.e. , intra-document processing; Stage 1 ).
- a list of common stop words, lexical markers and punctuations is provided as phrase delimiters.
- a set of candidate key phrases are generated using any number of well-known automated keyword extraction techniques such as RAKE. For example, in a customer care application these key phrases might be“service outage”, “technical dispatched” or“instruction manual not adhered to”, to name just a few.
- the candidate key phrases are then split, illustratively, into words for constructing the so-called word co-occurrence matrix.
- RAKE uses two measures - the frequency and degree of a word to compute the score of a candidate key phrase.
- the degree of a word is measured as the sum of its frequency in the document and the number of its co-occurrence with other words across all candidate key phrases of the text document. If a candidate phrase has multiple words, the scores of all the words are summed up.
- a descending order ranking is obtained for the candidate phrases and a selection is made of some percentage (illustratively, the top seventy percent (70%)) thereof are selected as the extracted key-phrases.
- Stage 1 i.e. , the intra-document processing
- Stage 2 i.e., the inter-document processing
- these key words and key phrase are akin to a generated“dictionary” for further use in the automated processing of the free- text input documents.
- a word distribution of topics and topic distribution of the documents is obtained. Illustratively these are obtained using a variant of Latent Dirichlet Allocation (LDA).
- Steps 125-130 represent a topical key word ranking stage (i.e., Stage 2).
- LDA is a well- known generative probabilistic model for collections of discrete data such as text corpora.
- LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
- FIG. 2 a variant of LDA is shown (as specially configured for the embodiments herein) in plate notation model 200. With plate notation, the dependencies among the many variables can be captured concisely. The boxes are so-called “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words (e.g., word W m> n 255) within a document.
- M 235 denotes the number of documents, N 240 the number of words in a document in accordance with the embodiment shown.
- the domain specific jargons are assumed to have a Dirichlet prior cp D 205 parameterized by b 210
- the topic distribution r 215 is assumed to have a Dirichlet prior parameterized by a 220
- p 250 is the Dirichlet prior with parameter g 220 for choosing between topic words and domain specific jargons.
- a word distribution is assumed with a Dirichlet prior cp k parameterized by b.
- z dm 225 denotes the topic of the m-th text document in the corpus and y m,n 230 is a binary variable indicating whether the n-th word of the m-th document is a topic word or a domain specific jargon.
- plate notation model 200 is a specially configured LDA variant for use in the embodiments herein. Additionally, there are certain domain specific jargons which are not associated to a specific topic. However, such words cannot be removed from the text as they bear significant semantic meaning. In the LDA variant embodiment shown in FIG. 2, an assumption is made that each word in each text document is either a domain specific jargon or a topic word.
- FIG. 3 is flowchart of operations 300 detailing the word generation process for each text document.
- the domain specific jargons are assumed to have a Dirichlet prior cp D 245 parameterized by b 210
- the topic distribution t is assumed to have a Dirichlet prior parameterized by a
- p is the Dirichlet prior with parameter g for choosing between topic words and domain specific jargons.
- the Dirichlet prior cp D 245 is chosen.
- steps 320 and 330 for each topic k, we assume a word distribution with a Dirichlet prior cp k parameterized by b.
- Step 330 provides details of the generation process of each document d m.
- K is the number of possible topics
- T is the categorical distribution of topics with parameter T, illustratively, Categorical(T).
- w m,n a binary random variable y m,n is sampled from the Bernoulli distribution parameterized by TT. If the value of the random variable y m,n is 0, then the word w m n is identified to be a domain specific jargon and hence, is sampled from the Categorical distribution of domain specific jargons with the Dirichlet prior (P D .
- the word w m ,n is identified to be a topical word for the previously chosen topic of the document and is sampled from the Categorical distribution of the chosen topic with the Dirichlet prior cp k .
- topical key phrase ranking is performed at step 135 (i.e., Stage 3).
- the topical key word ranking obtained in the topical word ranking stage is used to obtain a ranking of the candidate key phrases generated in Stage 1.
- Any automatic key phrase extraction technique may be utilized at step 135, illustratively, the technique described herein above in Liu, Zhiyuan, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010.
- each text document is represented as a sparse, high-dimensional vector (i.e., a collection of features) using the extracted key phrases and their respective ranking scores as features, and step 145 (i.e., Stage 5) dimensionality reduction is performed using a locality-sensitive hashing technique (for example, the so-called“SimHash” technique described in Charikar, Moses S. 2002, “Similarity Estimation Techniques from Rounding Algorithms.” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, ACM, 380-388).
- a locality-sensitive hashing technique for example, the so-called“SimHash” technique described in Charikar, Moses S. 2002, “Similarity Estimation Techniques from Rounding Algorithms.” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, ACM, 380-388).
- step 145 may be optional given the resulting compression obtained thereof improves overall speed performance.
- an agglomerative hierarchical clustering technique such as scipy.clustering. agglomerative is applied to cluster similar text documents. In this way, hierarchical clustering enables the automated assignment of both topics and sub-topics to enhance the clustering results such that the documents are clustered with each cluster being assigned specific key words and/or key phrases.
- step 155 there is a generation of categories (and/or groups) of the text documents by topic that is facilitated by measuring the density of the clusters generated in Stage 6. If the cluster density is higher than a pre-defined threshold value, the text documents are annotated to belong to that cluster, and the cluster is considered to be categorical. Otherwise, the cluster is not considered categorical. Instead, the top-K most frequent phrases from all the documents within the cluster are selected as representatives of the topic for the cluster, and the text documents are annotated to be part of this topic.
- the embodiments facilitate the free-form text processing that is automated, domain-independent, topic-independent, and document size-independent having a high quality level.
- the embodiments herein may be utilized for a wide variety of use cases. Some examples include, but are not limited to, (i) categorization of unstructured machine log messages: (ii) understanding groups and main topics of problems faced by customers as described in free form text in customer care tickets; (iii) revealing groups of topics with positive, neutral and negative sentiments from free-form customer survey verbatims; and (iv) preparing unstructured textual information by either categorizing it, or extracting topical key phrases to be used in lieu of the full text as part of the pre-processing steps of automated machine learning.
- FIGs. 4A - 4I shows an illustrative scenario and exemplary results in applying automated domain-independent free-form text processing in a technical service ticket/customer care domain in accordance with an embodiment.
- FIG. 4A shows service ticket 400 which is textual data input (see, step 105 of FIG. 1 ) in accordance with the embodiment.
- Service ticket 400 has original text 405 which is representative of a textual excerpt from an illustrative network operations trouble ticket (i.e. , ticket / in a set of n tickets, 1 ... n).
- FIG. 4C shows such results for service ticket 400 where plurality of candidate key phrases 425 have been extracted which include the following exemplary key phrases:“procedure integrating emc vnx storage array” 425-1 ,“boot screen” 425- 2, and“local team” 425-3, to highlight just a few.
- FIG. 4D shows results 430 from the completion of Stage 2 processing (as described herein above; see FIG. 1 ) for topical key word 430-1 and normalized ranking 430-2 having a plurality of key words 430-3 by normalized rank.
- the top ranked key phrases may be identified in Stage 3 (as described herein above; see FIG. 1 ) with FIG. 4E showing results 435 for the top twelve (12) ranked key phrases (i.e., ranked key phrase 435-1 through 435-11 in descending order) out of a total of 79 key phrases for ticket 400 (i.e., ticket /).
- feature vector 440 for ticket 400 i.e., ticket /
- has 10,852 dimensions i.e., dimension 440-1 through 440-DM
- each dimension representing a respective one of the 10,852 key phrases.
- the feature vector for ticket 400 i.e., ticket /
- the shaded boxes in FIG. 4F indicate key phrases that are found in ticket 400.
- FIG. 4F shows the high dimensional representation of ticket 400.
- This feature vector is sparse, wastes memory, and increases computational requirements. As such, in accordance with the embodiments herein, dimensionality reduction is applied in Stage 5 (as previously described herein above; see FIG. 1 ).
- results 445 shown in FIG. 4G is a compact feature representation for ticket 400 after dimensionality reduction is applied in accordance with the embodiment.
- the previous 10,852 dimensions i.e. , dimension 440-1 through 440-DM
- have been significantly reduced to a set of thirty-two (32) dimensions i.e., dimension 445-1 through 445- DM) each dimension represented by bit 445-1 and value 445-3.
- FIG. 4H shows results 450 which are the clusters (i.e., cluster 450-1 through 450-12 resulting from using hierarchical agglomerative clustering in Stage 6 (as described herein above; see FIG. 1 ).
- ticket 400 is in cluster 450-10.
- Stage 7 (as described herein above; see FIG. 1 ) processing results in a category or topic assignment.
- the density of cluster 450-10 was found to be lower than the pre defined threshold value.
- results 460 show no category (i.e., category 460-1 ) being assigned and topic assignment 465 having the assignment of topics 465-1 , 465-2, 465-3, 465-4, 465-5, 465-6, 465-7, 465-8, 465-9, and 465-10 for the cluster in which ticket 400 resides (i.e., cluster 450-10).
- the text processing pipeline automatically assigns either a hierarchical category or a topic identified by topical key words (for example, “disk” or“system” as shown in results 460).
- FIG. 5 is a high-level block diagram of an exemplary computer 500 that may be used for implementing a method for domain-independent automating processing of free-form text in accordance with the various embodiments herein.
- Computer 500 comprises a processor 510 operatively coupled to a data storage device 520 and a memory 530.
- Processor 510 controls the overall operation of computer 500 by executing computer program instructions that define such operations.
- Communications bus 560 facilitates the coupling and communication between the various components of computer 500.
- computer 500 may be any type of computing device such a computer, tablet, server, mobile device, smart phone, to name just a few.
- the computer program instructions may be stored in data storage device 520, or a non-transitory computer readable medium, and loaded into memory 530 when execution of the computer program instructions is desired.
- the steps of the disclosed method can be defined by the computer program instructions stored in memory 530 and/or data storage device 520 and controlled by processor 510 executing the computer program instructions.
- the computer program instructions can be implemented as computer executable code programmed by one skilled in the art to perform the illustrative operations defined by the disclosed method.
- processor 510 executes an algorithm defined by the disclosed method.
- Computer 500 also includes one or more communication interfaces 550 for communicating with other devices via a network (e.g., a wireless communications network) or communications protocol (e.g., Bluetooth®).
- a network e.g., a wireless communications network
- communications protocol e.g., Bluetooth®
- Computer 500 also includes one or more input/output devices 540 that enable user interaction with the user device (e.g., camera, display, keyboard, mouse, speakers, microphone, buttons, etc.).
- input/output devices 540 that enable user interaction with the user device (e.g., camera, display, keyboard, mouse, speakers, microphone, buttons, etc.).
- Processor 510 may include both general and special purpose microprocessors, and may be the sole processor or one of multiple processors of computer 500.
- Processor 510 may comprise one or more central processing units (CPUs), for example.
- CPUs central processing units
- Processor 510, data storage device 520, and/or memory 530 may include, be supplemented by, or incorporated in, one or more application-specific integrated circuits (ASICs) and/or one or more field programmable gate arrays (FPGAs).
- ASICs application-specific integrated circuits
- FPGAs field programmable gate arrays
- Data storage device 520 and memory 530 each comprise a tangible non-transitory computer readable storage medium.
- Data storage device 520, and memory 530 may each include high-speed random access memory, such as dynamic random access memory (DRAM), static random access memory (SRAM), double data rate synchronous dynamic random access memory (DDR RAM), or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices such as internal hard disks and removable disks, magneto-optical disk storage devices, optical disk storage devices, flash memory devices, semiconductor memory devices, such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disc read-only memory (DVD-ROM) disks, or other non-volatile solid state storage devices.
- DRAM dynamic random access memory
- SRAM static random access memory
- DDR RAM double data rate synchronous dynamic random access memory
- non-volatile memory such as one
- Input/output devices 540 may include peripherals, such as a camera, printer, scanner, display screen, etc.
- input/output devices 540 may include a display device such as a cathode ray tube (CRT), plasma or liquid crystal display (LCD) monitor for displaying information to the user, a keyboard, and a pointing device such as a mouse or a trackball by which the user can provide input to computer 500.
- a display device such as a cathode ray tube (CRT), plasma or liquid crystal display (LCD) monitor for displaying information to the user
- keyboard for displaying information to the user
- a pointing device such as a mouse or a trackball by which the user can provide input to computer 500.
- DSP digital signal processor
- any flowcharts, flow diagrams, state transition diagrams, pseudo code, program code and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer, machine or processor, whether or not such computer, machine or processor is explicitly shown.
- One skilled in the art will recognize that an implementation of an actual computer or computer system may have other structures and may contain other components as well, and that a high level representation of some of the components of such a computer is for illustrative purposes.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un procédé et un appareil pour un traitement de texte en forme libre amélioré qui est non informé du domaine et identifie automatiquement des sujets clé pour des données textuelles qui ne nécessitent pas d'apprentissage préalable ou de supervision, de pré-étiquetage ou d'annotation des données, ou d'une expertise de domaine. Un processus à étapes multiples (c'est-à-dire, un pipeline de traitement de texte) est utilisé et se trouve être indépendant du domaine, automatisé et non supervisé. Compte tenu d'un ensemble de documents textuels de forme libre, le pipeline de traitement de texte tire profit à la fois de statistiques d'occurrence de mots intradocuments et de corpus (c'est-à-dire, inter-documents) pour une extraction de phrases clés de haute qualité. Une représentation de caractéristique automatique est effectuée à l'aide des phrases clés extraites et une réduction de dimensionnalité est appliquée pour minimiser la distance entre des documents quasi-similaires dans l'espace vectoriel. Pour chaque document textuel, le pipeline de traitement de texte attribue soit une catégorie hiérarchique soit un sujet identifié au moyen de des mots clés topiques.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2017/068911 WO2020131004A1 (fr) | 2017-12-29 | 2017-12-29 | Traitement automatisé indépendant du domaine de texte en forme libre |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2017/068911 WO2020131004A1 (fr) | 2017-12-29 | 2017-12-29 | Traitement automatisé indépendant du domaine de texte en forme libre |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020131004A1 true WO2020131004A1 (fr) | 2020-06-25 |
Family
ID=71102291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2017/068911 WO2020131004A1 (fr) | 2017-12-29 | 2017-12-29 | Traitement automatisé indépendant du domaine de texte en forme libre |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020131004A1 (fr) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112464656A (zh) * | 2020-11-30 | 2021-03-09 | 科大讯飞股份有限公司 | 关键词抽取方法、装置、电子设备和存储介质 |
CN113157908A (zh) * | 2021-03-22 | 2021-07-23 | 北京邮电大学 | 一种展示社交媒体热点子话题的文本可视化方法 |
CN115630160A (zh) * | 2022-12-08 | 2023-01-20 | 四川大学 | 一种基于半监督共现图模型的争议焦点聚类方法及系统 |
CN116522901A (zh) * | 2023-06-29 | 2023-08-01 | 金锐同创(北京)科技股份有限公司 | It社群的关注信息的分析方法、装置、设备和介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020131644A1 (en) * | 2001-01-31 | 2002-09-19 | Hiroaki Takebe | Pattern recognition apparatus and method using probability density function |
US8131735B2 (en) * | 2009-07-02 | 2012-03-06 | Battelle Memorial Institute | Rapid automatic keyword extraction for information retrieval and analysis |
US20130311505A1 (en) * | 2011-08-31 | 2013-11-21 | Daniel A. McCallum | Methods and Apparatus for Automated Keyword Refinement |
US20140304267A1 (en) * | 2008-05-07 | 2014-10-09 | City University Of Hong Kong | Suffix tree similarity measure for document clustering |
WO2016085409A1 (fr) * | 2014-11-24 | 2016-06-02 | Agency For Science, Technology And Research | Procédé et système de classification de sentiments et de classification d'émotions |
US20160350404A1 (en) * | 2015-05-29 | 2016-12-01 | Intel Corporation | Technologies for dynamic automated content discovery |
US20170351676A1 (en) * | 2016-06-02 | 2017-12-07 | International Business Machines Corporation | Sentiment normalization using personality characteristics |
US20170364587A1 (en) * | 2016-06-20 | 2017-12-21 | International Business Machines Corporation | System and Method for Automatic, Unsupervised Contextualized Content Summarization of Single and Multiple Documents |
-
2017
- 2017-12-29 WO PCT/US2017/068911 patent/WO2020131004A1/fr active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020131644A1 (en) * | 2001-01-31 | 2002-09-19 | Hiroaki Takebe | Pattern recognition apparatus and method using probability density function |
US20140304267A1 (en) * | 2008-05-07 | 2014-10-09 | City University Of Hong Kong | Suffix tree similarity measure for document clustering |
US8131735B2 (en) * | 2009-07-02 | 2012-03-06 | Battelle Memorial Institute | Rapid automatic keyword extraction for information retrieval and analysis |
US20130311505A1 (en) * | 2011-08-31 | 2013-11-21 | Daniel A. McCallum | Methods and Apparatus for Automated Keyword Refinement |
WO2016085409A1 (fr) * | 2014-11-24 | 2016-06-02 | Agency For Science, Technology And Research | Procédé et système de classification de sentiments et de classification d'émotions |
US20160350404A1 (en) * | 2015-05-29 | 2016-12-01 | Intel Corporation | Technologies for dynamic automated content discovery |
US20170351676A1 (en) * | 2016-06-02 | 2017-12-07 | International Business Machines Corporation | Sentiment normalization using personality characteristics |
US20170364587A1 (en) * | 2016-06-20 | 2017-12-21 | International Business Machines Corporation | System and Method for Automatic, Unsupervised Contextualized Content Summarization of Single and Multiple Documents |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112464656A (zh) * | 2020-11-30 | 2021-03-09 | 科大讯飞股份有限公司 | 关键词抽取方法、装置、电子设备和存储介质 |
CN112464656B (zh) * | 2020-11-30 | 2024-02-13 | 中国科学技术大学 | 关键词抽取方法、装置、电子设备和存储介质 |
CN113157908A (zh) * | 2021-03-22 | 2021-07-23 | 北京邮电大学 | 一种展示社交媒体热点子话题的文本可视化方法 |
CN115630160A (zh) * | 2022-12-08 | 2023-01-20 | 四川大学 | 一种基于半监督共现图模型的争议焦点聚类方法及系统 |
CN115630160B (zh) * | 2022-12-08 | 2023-07-07 | 四川大学 | 一种基于半监督共现图模型的争议焦点聚类方法及系统 |
CN116522901A (zh) * | 2023-06-29 | 2023-08-01 | 金锐同创(北京)科技股份有限公司 | It社群的关注信息的分析方法、装置、设备和介质 |
CN116522901B (zh) * | 2023-06-29 | 2023-09-15 | 金锐同创(北京)科技股份有限公司 | It社群的关注信息的分析方法、装置、设备和介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11475319B2 (en) | Extracting facts from unstructured information | |
US8108413B2 (en) | Method and apparatus for automatically discovering features in free form heterogeneous data | |
Sawyer et al. | Shallow knowledge as an aid to deep understanding in early phase requirements engineering | |
Inzalkar et al. | A survey on text mining-techniques and application | |
JP2023052502A (ja) | 機械学習モデルを迅速に構築し、管理し、共有するためのシステム及び方法 | |
US8630989B2 (en) | Systems and methods for information extraction using contextual pattern discovery | |
WO2020131004A1 (fr) | Traitement automatisé indépendant du domaine de texte en forme libre | |
JP2015505082A (ja) | 情報ドメインに対する自然言語処理モデルの生成 | |
US9460071B2 (en) | Rule development for natural language processing of text | |
US20170109358A1 (en) | Method and system of determining enterprise content specific taxonomies and surrogate tags | |
WO2022126944A1 (fr) | Procédé de regroupement de textes, dispositif électronique et support de stockage | |
Forman et al. | Pragmatic text mining: minimizing human effort to quantify many issues in call logs | |
CN111353045B (zh) | 构建文本分类体系的方法 | |
Sangodiah et al. | Taxonomy Based Features in Question Classification Using Support Vector Machine. | |
Bollegala et al. | ClassiNet--Predicting missing features for short-text classification | |
Alsaidi et al. | English poems categorization using text mining and rough set theory | |
CN114239828A (zh) | 一种基于因果关系的供应链事理图谱构建方法 | |
Li et al. | Tagdeeprec: tag recommendation for software information sites using attention-based bi-lstm | |
Sharma et al. | Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation | |
Uskenbayeva et al. | Creation of Data Classification System for Local Administration | |
Punitha et al. | Partition document clustering using ontology approach | |
Makinist et al. | Preparation of improved Turkish dataset for sentiment analysis in social media | |
Mishra et al. | Fault log text classification using natural language processing and machine learning for decision support | |
Violos et al. | Clustering documents using the 3-gram graph representation model | |
Bhowmik et al. | Domain-independent automated processing of free-form text data in telecom |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17936996 Country of ref document: EP Kind code of ref document: A1 |