WO2019232645A1 - Classification non supervisée de documents grâce à un ensemble de données étiquetées d'autres documents - Google Patents
Classification non supervisée de documents grâce à un ensemble de données étiquetées d'autres documents Download PDFInfo
- Publication number
- WO2019232645A1 WO2019232645A1 PCT/CA2019/050806 CA2019050806W WO2019232645A1 WO 2019232645 A1 WO2019232645 A1 WO 2019232645A1 CA 2019050806 W CA2019050806 W CA 2019050806W WO 2019232645 A1 WO2019232645 A1 WO 2019232645A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grouping
- subject document
- documents
- matching
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Definitions
- the present invention relates to the classification of documents. More specifically, the present invention relates to systems and methods for unsupervised classification of documents using other previously classified documents.
- unsupervised classification referring to document classification that is not overseen by a human.
- Such techniques are typically computationally expensive and time-intensive.
- Current state-of-the-art techniques for unsupervised classification require access to the entire dataset at all times, meaning that enormous datasets may need to be searched for each operation.
- the present invention provides systems and methods for associating an unknown subject document with other documents based on known features of the other documents.
- the subject document is passed through a feature extrachon module, which represents the features of the subject document as a numeric vector having n dimensions.
- a matching module receives that vector and reference data.
- the reference data is pre-divided into n groupings, with each grouping corresponding to at least one specific feature.
- the matching module compares the features of the subject document to features of the reference data and determines a matching grouping for the subject document.
- the subject document is then associated with that matching grouping.
- the present invention provides a method for determining other documents to be associated with a subject document, the method comprising: (a) passing said subject document through a feature extraction module to thereby produce a numeric vector representation of features of said subject document, said vector representation having n dimensions;
- said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
- the present invention provides a system for determining other documents to be associated with a subject document, the system comprising: a feature extraction module for producing a numeric vector representation of features of said subject document;
- reference data comprising numeric vectors, wherein each of said other documents corresponds to a single one of said numeric vectors, and wherein said reference data is grouped into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents; a matching module for determining a matching grouping from said plurality of groupings for said subject document, said matching grouping being determined based on at least one predetermined criterion,
- the present invention provides non-transitory computer-readable media having stored thereon computer-readable and computer-executable instructions that, when executed, implements a method for determining other documents to be associated with a subject document, the method comprising:
- said n-dimensional space contains a plurality of reference points, wherein each of said other documents corresponds to a single one of said plurality of reference points, and wherein said plurality of reference points is divided into a plurality of groupings, each grouping corresponding to at least one specific feature of said other documents;
- Figure 1 is a schematic diagram detailing one aspect of the present invention
- Figure 2A is a representative image that may be used by the present invention, in some embodiments.
- Figure 2B is another representative image that may be used by the present invention, in some embodiments.
- Figure 2C is another representative image that may be used by the present invention, in some embodiments; and Figure 3 is a flowchart detailing the steps in a method according to another aspect of the present invention.
- FIG. 1 a schematic diagram illustrating one aspect of the present invention is presented.
- a subject document 20 is passed through a feature extraction module 30.
- the feature extraction module 30 is associated with a matching module 50.
- Reference data 40 is also associated with the matching module 50.
- the feature extrachon module 30 extracts features of the subject document 20 and produces a numeric vector representation of the subject document 20 based on those features, with the vector having n dimensions. The feature extraction module 30 then passes that vector representation to a matching module 50.
- the matching module 50 also receives reference data 40, representing other
- the reference data 40 is previously divided into groupings, with each grouping corresponding to at least one specific feature of the documents within that grouping.
- the matching module 50 compares the vector representation of the features of the subject document 20 to features of the reference data 40 to determine a matching grouping for the subject document 20.
- the subject document 20 is then associated with that matching grouping.
- a neural network can be used as the feature extraction module
- Such a neural network would be trained to extract specific features that the user wishes to identify. Note therefore the features to be extracted may vary depending on context. For instance, a large set of news articles broadly grouped as“sports news” may be classified using keywords such as“hockey”,“basketball”, and “soccer”. As another example, where the present invention is used to classify images, important features and themes may include“face” or“tree”.
- the identified features may be thought of as“keywords”, “subjects”,“topics”,“themes”,“subthemes”,“aspects”, or any equivalent term suitable for the context. References herein to any of such terms should be taken to include all such terms.
- the subject document 20 can be any kind of document with any number of dimensions.
- one-dimensional “documents” may include text, time series data, and/or sounds
- two-dimensional documents may comprise natural images, spectrograms, satellite images.
- Subject documents in three-dimensions may include videos and/or medical imaging volumes
- four-dimensional subject documents may include videos of medical imaging volumes, as well as video game data.
- the neural network or other feature extraction module 30, outputs a numeric vector representation of the subject document.
- Each coordinate within that numeric vector representation corresponds to one of the possible features.
- each coordinate can be a numeric value indicating the probability that the subject document has that specific feature.
- the coordinates may be bounded between, for instance, 0 and 1, or 0 and 100. In other implementations, however, each coordinate may reflect a non-probabilistic correspondence to its associated feature.
- a subject article discussing a hockey game might be represented as a numeric vector such as [0.8, 0.1, 0.1], in a three- dimensional space defined by the coordinate system [hockey, basketball, soccer ]. This vector suggests that there is an 80% chance that the article relates to hockey and only a 10% chance that the article relates to either basketball or soccer.
- outlier documents in the data set are sent to a human
- outlier documents are documents that do not match well with any known features.
- the system will provide an outlier document and its closest feature matches to the human reviewer, who can select a better feature match if necessary.
- the results of this human review can be fed back to the system and incorporated into later classifications.
- the feature extrachon module 30 can be initially trained on a similar or higher-level classification problem than the classification problem to be solved.
- the reference data points 40 are already populated and grouped when they are passed to the feature extraction module 30.
- the numeric vector is then passed to a matching module 50.
- the matching module
- the matching module 50 compares the vector to a pre-existing set of reference data 40.
- the reference data points are based on reference documents with known features.
- the reference data points 40, received by the matching module 50 are divided into a plurality of groupings, such that each grouping corresponds to at least one specific feature. In the“sports articles” example above, reference data would be divided into three or more groupings, including“hockey”,“basketball”, and“soccer”.
- “approximate nearest neighbour algorithms” can be used to divide the reference data into groupings.
- Various approximate nearest neighbour algorithms are well-known in computer science and data analysis (see, for instance, Indyk & Motwani,“Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, Proceedings of the thirtieth annual ACM symposium on Theory of Computing (1998), the complete text of which is hereby incorporated by reference).
- the matching module 50 may consider multiple factors when determining a
- the matching module 50 can consider the distance between the subject document’s numeric vector representation and the centroids of reference groupings.
- the matching module 50 may also consider factors beyond the distance to grouping centroids. Such other factors include, for example, publication dates or date ranges: more recent news articles dealing with sports or politics may be grouped separately from older articles on the same topics. Other factors (such as, for instance, author, region, publication, etc.) may also be used by the matching module 50 in determining a matching grouping for the subject grouping, according to the context. Any variable that is continuously available in the data set (i.e., separately available for each document in the data set) can be used to modify or weight the results of the feature extraction module.
- the present invention preferably uses a reference data set containing around 1,000 reference data points.
- the present invention results in significant time-savings compared to manual and/or typical text-analysis document classification methods, and additionally provides greater classification accuracy than the prior methods.
- each“theme”,“subtheme”, and“region” was a separate extracted feature. It is predicted that the present invention could classify a set of 10,000,000 subject documents within 100 milliseconds.
- a further advantage of the present invention over prior art methods arises when the present invention is used to classify images.
- Artificially intelligent image classification is typically performed in a pixel-space. That is, typical machine classifiers for images produce a numeric vector representation wherein the vector coordinates correspond to pixels, or pixel regions, of each image.
- pixel-space representations can mis associate images that are substantively similar but positionally distinct.
- images can be generalized to any kind of multi-dimensional input data.
- Figure 2A shows a stylized face in the top left of the image, and nothing in the bottom right.
- Figure 2B shows a circle in the same location as Figure 2A’s stylized face, but shows a pair of triangles within the circle, rather than that face.
- Figure 2C lastly, has the same stylized face as Figure 2A, but here the face is shown in the bottom right of the image, and the top left of the image is empty. Because typical pixel-space classifiers merely consider positional information, a typical classifier would conclude that Figures 2A and 2B are more similar to each other than are Figures 2A and 2C, notwithstanding the visibly distinct subject matter.
- the present invention compares images based on substance rather than pixel density.
- the vector representing Figure 2A would then have comparatively high values in all three coordinates, as Figure 2A has a circle, (stylized) eyes and a (stylized) mouth.
- Figure 2C the new subject document in this example, would then be passed through the feature extrachon module.
- the vector representation of Figure 2C like that of Figure 2A, would have comparatively high values for all three features.
- the matching module receives that vector and the reference vectors, the matching module would determine that the vector representation of Figure 2C is more similar to that of Figure 2A than to Figure 2b, and would thus associate Figure 2C with Figure 2A.
- Both images containing a face would be grouped together in the“face” grouping, and only Figure 2B would remain in the“not face” grouping.
- Other applications of the present invention include reverse searches.
- the system may return a high-level grouping or a more granular grouping, or even, in some implementations, a specific document.
- FIG. 3 is a flowchart detailing the steps in a method according to one aspect of the invention.
- the features of a subject document are extracted by a feature extraction module, resulting in a numeric vector representation of the subject document. That numeric vector representation and the grouped reference data 40 is passed to the matching module.
- the matching module determines the matching grouping for the subject document, and at step 330, the subject document is associated with that matching grouping. As discussed above, the matching module typically determines the matching grouping based on a distance between the new vector representation and a centroid of each grouping. This matching process is performed for every new subject document.
- the present invention can automatically classify large groups of subject documents without human intervention.
- the present invention can be seen as the use of a neural encoder with a proxy task related to the task the one seeks to complete.
- the result is a fast unsupervised classification technique that takes into account the entire past and which uses post processing methods to refine the results.
- one aspect of the invention uses an already existing fast nearest-neighbours retrieval technique to perform classification. The results are then refined by weighting the contribution of each neighbour using non-parametric methods. As one example, the contribution of neighbours is weighted with respect to the recency of the document. In one variant, a neural network may be used to output a weight for the examples. [0041] In another aspect, the present invention uses supervised training to learn
- a closely related problem which is higher-level than the problem sought to be solved, is used to build the neural encoder that will yield a suitable feature space.
- any references herein to 'image' or to 'images' refer to a digital image or to digital images, comprising pixels or picture cells.
- any references to an 'audio file' or to 'audio files' refer to digital audio files, unless otherwise specified.
- 'Video', 'video files', 'data objects', 'data files' and all other such terms should be taken to mean digital files and/or data objects, unless otherwise specified.
- the present invention may thus take the form of computer executable instructions that, when executed, implements various software modules with predefined functions.
- the embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps.
- an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM), Read Only Memory (ROM) or similar computer software storage media known in the art, may be programmed to execute such method steps.
- electronic signals representing these method steps may also be transmitted via a communication network.
- Embodiments of the in vend on may be implemented in any conventional computer programming language.
- preferred embodiments may be implemented in a procedural programming language (e.g., "C") or an object-oriented language (e.g., "C++",“java”,“PHP”,“PYTHON” or“C#”).
- object-oriented language e.g., "C++",“java”,“PHP”,“PYTHON” or“C#”.
- Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
- Embodiments can be implemented as a computer program product for use with a computer system.
- Such implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
- the medium may be either a tangible medium (e.g., optical or electrical communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
- the series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems.
- Such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies.
- a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server over a network (e.g., the Internet or World Wide Web).
- some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne des systèmes et des procédés permettant d'associer un document sujet inconnu à d'autres documents en fonction de caractéristiques connues des autres documents. Le document sujet passe dans un module d'extraction de caractéristiques, qui représente les caractéristiques du document sujet sous la forme d'un vecteur numérique à n dimensions. Un module de mise en correspondance reçoit ce vecteur et des données de référence. Les données de référence sont prédivisées en n groupements, chaque groupement correspondant à au moins une caractéristique spécifique. Le module de mise en correspondance compare les caractéristiques du document sujet avec des caractéristiques des données de référence et détermine un groupement correspondant pour le document sujet. Le document sujet est ensuite associé à ce groupement correspondant.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/002,811 | 2018-06-07 | ||
CA3,007,547 | 2018-06-07 | ||
CA3007547A CA3007547A1 (fr) | 2018-06-07 | 2018-06-07 | Classement non supervise de documents au moyen d'un ensemble de donnees etiquetees d'autres documents |
US16/002,811 US20190377823A1 (en) | 2018-06-07 | 2018-06-07 | Unsupervised classification of documents using a labeled data set of other documents |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019232645A1 true WO2019232645A1 (fr) | 2019-12-12 |
Family
ID=68769684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CA2019/050806 WO2019232645A1 (fr) | 2018-06-07 | 2019-06-07 | Classification non supervisée de documents grâce à un ensemble de données étiquetées d'autres documents |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019232645A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079442A (zh) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | 文档的向量化表示方法、装置和计算机设备 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050234952A1 (en) * | 2004-04-15 | 2005-10-20 | Microsoft Corporation | Content propagation for enhanced document retrieval |
US20180046649A1 (en) * | 2016-08-12 | 2018-02-15 | Aquifi, Inc. | Systems and methods for automatically generating metadata for media documents |
-
2019
- 2019-06-07 WO PCT/CA2019/050806 patent/WO2019232645A1/fr active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050234952A1 (en) * | 2004-04-15 | 2005-10-20 | Microsoft Corporation | Content propagation for enhanced document retrieval |
US20180046649A1 (en) * | 2016-08-12 | 2018-02-15 | Aquifi, Inc. | Systems and methods for automatically generating metadata for media documents |
Non-Patent Citations (2)
Title |
---|
INDYK, P. ET AL.: "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality", PROCEEDINGS OF THE THIRTIETH ANNUAL ACM SYMPOSIUM ON THEORY OF COMPUTING, 23 May 1998 (1998-05-23), pages 604 - 613, XP058190106 * |
PAULOVICH, F.V. ET AL.: "HiPP: A Novel Hierarchical Point Placement Strategy and its Application to the Exploration of Document Collections", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 14, no. 6, 19 October 2008 (2008-10-19), pages 1229 - 1236, XP011324177, DOI: 10.1109/TVCG.2008.138 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079442A (zh) * | 2019-12-20 | 2020-04-28 | 北京百度网讯科技有限公司 | 文档的向量化表示方法、装置和计算机设备 |
CN111079442B (zh) * | 2019-12-20 | 2021-05-18 | 北京百度网讯科技有限公司 | 文档的向量化表示方法、装置和计算机设备 |
US11403468B2 (en) | 2019-12-20 | 2022-08-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for generating vector representation of text, and related computer device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10949702B2 (en) | System and a method for semantic level image retrieval | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
US9569698B2 (en) | Method of classifying a multimodal object | |
Karthikeyan et al. | Probability based document clustering and image clustering using content-based image retrieval | |
US20240273134A1 (en) | Image encoder training method and apparatus, device, and medium | |
US20190377823A1 (en) | Unsupervised classification of documents using a labeled data set of other documents | |
CN115098690B (zh) | 一种基于聚类分析的多数据文档分类方法及系统 | |
Etezadifar et al. | Scalable video summarization via sparse dictionary learning and selection simultaneously | |
US20160026656A1 (en) | Retrieving/storing images associated with events | |
JP2011070244A (ja) | 画像検索装置、画像検索方法及びプログラム | |
CN113408282B (zh) | 主题模型训练和主题预测方法、装置、设备及存储介质 | |
WO2019232645A1 (fr) | Classification non supervisée de documents grâce à un ensemble de données étiquetées d'autres documents | |
Banerjee et al. | A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports | |
KR102590388B1 (ko) | 영상 컨텐츠 추천 장치 및 방법 | |
Tejaswi Nayak et al. | Video retrieval using residual networks | |
Jakhar et al. | Classification and Measuring Accuracy of Lenses Using Inception Model V3 | |
Inoue et al. | Few-shot adaptation for multimedia semantic indexing | |
CA3007547A1 (fr) | Classement non supervise de documents au moyen d'un ensemble de donnees etiquetees d'autres documents | |
Sankar et al. | Robust Feature Extraction Using Embedded Gan in Image Retrieval Systems for SEMI-Supervised Data | |
Hatem et al. | Exploring feature dimensionality reduction methods for enhancing automatic sport image annotation | |
Kim | Efficient histogram dictionary learning for text/image modeling and classification | |
Selvakumar et al. | Healthcare Multimedia Data Analysis Algorithms Tools and Applications | |
Rodríguez et al. | Meaningful bags of words for medical image classification and retrieval | |
Bommisetty et al. | Content based video retrieval—methods, techniques and applications | |
Ramya et al. | XML based approach for object oriented medical video retrieval using neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19815352 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19815352 Country of ref document: EP Kind code of ref document: A1 |