CN112084338B - Automatic document classification method, system, computer equipment and storage medium - Google Patents

Automatic document classification method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN112084338B
CN112084338B CN202010983960.XA CN202010983960A CN112084338B CN 112084338 B CN112084338 B CN 112084338B CN 202010983960 A CN202010983960 A CN 202010983960A CN 112084338 B CN112084338 B CN 112084338B
Authority
CN
China
Prior art keywords
semantic
vector
text
training
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010983960.XA
Other languages
Chinese (zh)
Other versions
CN112084338A (en
Inventor
侯聪
陈运文
纪达麒
韩伟
白良俊
文敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202010983960.XA priority Critical patent/CN112084338B/en
Publication of CN112084338A publication Critical patent/CN112084338A/en
Application granted granted Critical
Publication of CN112084338B publication Critical patent/CN112084338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an automatic document classification method, an automatic document classification system, computer equipment and a storage medium, wherein the automatic document classification method trains a language model according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data; and then, archiving and classifying based on the semantic encoder: based on the semantic encoder, the text is classified on the small dataset using an unsupervised approach, using the nearest neighbor idea. The universal semantic encoder is trained on a large amount of data in the universal field, so that semantics can be effectively encoded, additional training can be omitted on a very small data set of a new actual scene, and the generalization capability is poor due to the phenomenon of overfitting. The added and deleted documents or the modified classification system can be validated only by changing the related documents or classification, and the model is not required to be retrained, so that the timeliness is good.

Description

Automatic document classification method, system, computer equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, a system, a computer device, and a storage medium for automatically classifying documents.
Background
Document classification systems belong to a text classification application that automatically classifies input text under a certain classification by a model. Document categorization is often applied in a specific, relatively narrow, professional area, thus facing two problems:
1. the labeling data is less. Many fields of application are difficult to collect data in the public field due to factors such as strong professional or confidentiality requirements, and when the application is performed, part of the categories even comprise only a few documents;
2. in practical application, users can dynamically add and delete data and even change classification systems in the using process.
Common classification models based on supervised learning classification, such as Fasttext, textCNN and BERT-based text classification applications, all employ the same model: training data is acquired aiming at a classification system, then a model is trained, and document classification prediction can be performed after training is finished. This scheme is applicable in the scenario of having fully annotated data; however, the following disadvantages exist when such a scheme is directly applied to document classification in the professional field with sparse labeling data:
1. under the condition of small data volume, fitting is very easy, so that the generalization capability of the model is reduced, and the prediction effect is poor;
2. under the condition of dynamically adding and deleting data or a classification system, frequent training is needed to change the model, and great pressure is caused on hardware.
Disclosure of Invention
In order to solve the problems, the invention provides an automatic document classifying method, an automatic document classifying system, computer equipment and a storage medium, wherein a pre-training model semantic model with strong generalization capability and based on a large amount of data is applied to a small data scene, so that overfitting is avoided; meanwhile, the concept of nearest neighbor clustering is applied to search classification, and a new training model with heavy frequency in practical application is avoided.
The invention discloses an automatic document classifying method, which comprises the following steps:
s1, training a language model according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data;
s2, archiving and classifying based on semantic encoders: based on the semantic encoder, the text is classified on the small dataset using an unsupervised approach, using the nearest neighbor idea.
Further, step S1 comprises the sub-steps of:
s101, training by adopting unlabeled text data: selecting a language model based on a self-attention framework, and training the language model on unlabeled text data to enable the language model to learn the common sense of a target language;
s102, training by adopting similar text data: obtaining similar text data in the general field to form similar text pairs comprising anchor texts and similar texts, and randomly obtaining a dissimilar text in a corpus aiming at each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and are semantically unrelated; training a plurality of pieces of training data based on the language model, respectively inputting anchor point text, similar text and dissimilar text into the same language model, and respectively obtaining vectors V representing respective semantics a ,V p ,V n And then, calculating a ternary loss function to obtain loss, and retraining the language model to obtain the semantic encoder.
Further, the expression of the ternary loss function is as follows:
loss=max{||V a -V p || 2 -||V a -V n || 2 +margin,0}
wherein loss is loss, ||V a -V p || 2 Represents V a And V p The distance in the space is such that,||V a -V n || 2 represents V a And V n Distance in space, margin is a constant, representing a desired spatial distance; the ternary loss function may pull the distance between anchor text and similar text, and distance between anchor text and dissimilar text.
Further, the semantic encoder can learn the capability of performing semantic encoding on the text, namely, the closer the semantic vectors obtained by encoding the more similar text through the semantic encoder are in space, the farther the semantic vectors are in space, otherwise, the semantic vectors are in space.
Further, the obtaining similar text data in the general field includes: and crawling similar text recommendation information of the website through the crawler.
Further, step S2 comprises the sub-steps of:
s201, constructing a classification system and uploading a plurality of documents to each classification: uploading the documents to the semantic encoder according to classification, and encoding each document by the semantic encoder to obtain a semantic vector and storing the semantic vector into a vector database according to a classification system; the semantic vector under each category forms a vector set, and the cluster center of the vector set is calculated to be used as the feature vector of the category;
s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by the semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector.
Further, the user can modify the classification system or the classified document, and the feature vector of the modified classification changes accordingly.
The invention relates to an automatic document classification system, which comprises a semantic encoder, a vector database and a vector retrieval module;
the semantic encoder is trained on the basis of a language model according to unlabeled text data and similar text data, and is used for storing semantic vectors obtained by encoding classified documents into the vector database according to a classification system and outputting the semantic vectors obtained by encoding the documents to be classified to the vector retrieval module;
in the vector database, semantic vectors under each category form a vector set, and the cluster center of the vector set is used as the characteristic vector of the category;
the vector retrieval module is used for searching the feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into the category corresponding to the feature vector.
Further, a computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the automatic document classification method when executing the computer program.
Further, a computer readable storage medium stores a computer program which, when executed by a processor, implements the steps of the document automatic categorization method described above.
The invention has the beneficial effects that:
1. in the automatic document classifying method, the first step is offline, a semantic encoder capable of fully representing semantic features of a text is obtained through two-stage pre-training, and a large number of model training calculations are offline in the first step; the second step is performed online, the text is classified on the small data set by using an unsupervised method through the nearest neighbor thought, and meanwhile, the pressure of hardware can be relieved due to the fact that an unsupervised method is adopted in the online part.
2. The universal semantic encoder is trained on a large amount of data in the universal field, so that semantics can be effectively encoded, additional training can be omitted on a very small data set of a new actual scene, and the generalization capability is poor due to the phenomenon of overfitting.
3. The method can take effect only by changing the related documents or classification, does not need to retrain a model, and has good timeliness and low requirement on hardware.
Drawings
FIG. 1 is a schematic diagram of training data based on a language model in an embodiment of the present invention;
FIG. 2 is a schematic diagram of archive categorization based on semantic encoders in an embodiment of the invention.
Detailed Description
Specific embodiments of the present invention will now be described in order to provide a clearer understanding of the technical features, objects and effects of the present invention. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
Example 1
The embodiment provides a document automatic classifying method, which comprises the following steps:
s1, training a language model according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data;
s2, archiving and classifying based on semantic encoders: with the nearest neighbor idea, text is classified on a small dataset using an unsupervised approach based on a semantic encoder.
Step S1 is performed offline, a semantic encoder capable of fully representing semantic features of a text is obtained through two-stage pre-training, and a large number of model training calculations are performed offline in the step; the second step is performed online, the text is classified on the small data set by using an unsupervised method through the nearest neighbor thought, and meanwhile, the pressure of hardware can be relieved due to the fact that an unsupervised method is adopted in the online part.
Specifically, step S1 includes the following sub-steps:
s101, training by adopting unlabeled text data: selecting a language model based on a self-attention framework, and training the language model on unlabeled text data to enable the language model to learn the common sense of a target language;
s102, training by adopting similar text data: obtaining similar text data in the general field (for example, crawling similar text recommendation information of a website through a crawler) to form similar text pairs comprising anchor texts and similar texts, and randomly fetching a dissimilar text in a corpus for each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and semantically uncorrelated; training several pieces of training data based on language model, as shown in figure 1, respectively inputting anchor text, similar text and dissimilar text into the same language model (meaning encoder in figure 1), and respectively obtaining vectors V representing respective semantics a ,V p ,V n And then, calculating a ternary loss function to obtain loss, and retraining a language model to obtain the semantic encoder.
For example, there is a pair of similar texts A, B, a and B that are semantically similar, and then there is a random extraction of texts C, C and A, B, neither of which have semantically similar relationships. Therefore, A can be used as anchor text, the coding vector of B text (meaning similar to A) is close to the vector of A, and the coding vector of C text (meaning dissimilar to A) is far away from A by means of ternary loss function in training process. In this process, the a text serves as an "anchor" as a benchmark. Similarly, the B text may also be used as anchor text.
More specifically, the ternary loss function may refer to the following expression:
loss=max{||V a -V p ||2-||V a -V n || 2 +margin,0}}
where loss is loss, a represents anchor text (anchor), p represents text (active) similar to anchor text, n represents text (negative) dissimilar to anchor text, V a ,V p ,V n Respectively representing three texts, and obtaining the three texts after passing through an encoderIs described. I 2 Represents the 2-norm of a certain vector, and thus V a -V p || 2 Represents V a And V p Distance in space, ||V a -V n || 2 Represents V a And V n Distance in space; margin is a constant that represents a desired spatial distance. Optimizing the entire ternary loss function means that V is desired a And V p Is greater than V a And V n Preferably, the difference exceeds margin.
The ternary loss function may pull the distance between anchor text and similar text, and keep the distance between anchor text and dissimilar text. Therefore, the semantic encoder can learn the capability of performing semantic encoding on the text, namely, the closer the semantic vectors obtained by encoding the text which is more similar through the semantic encoder are in space, the farther the semantic vectors are in space, otherwise, the semantic vectors are in space.
Specifically, as shown in fig. 2, step S2 includes the following sub-steps:
s201, constructing a classification system and uploading a plurality of documents to each classification: uploading the documents to a semantic encoder according to classification, encoding each document by the semantic encoder to obtain semantic vectors, and storing the semantic vectors into a vector database according to a classification system; the semantic vector under each category forms a vector set, and the cluster center of the vector set is calculated to be used as the feature vector of the category;
s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by a semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector.
Alternatively, the user can modify the classification system or the classified documents, and the feature vectors of the modified classification will change accordingly.
In addition, the embodiment provides an automatic document classification system, which comprises a semantic encoder, a vector database and a vector retrieval module, wherein:
the semantic encoder is trained on the basis of a language model according to unlabeled text data and similar text data, and is used for storing semantic vectors obtained by encoding classified documents into a vector database according to a classification system and outputting the semantic vectors obtained by encoding the documents to be classified to a vector retrieval module;
in the vector database, the semantic vector under each category forms a vector set, and the cluster center of the vector set is used as the characteristic vector of the category;
and the vector retrieval module is used for searching the feature vector closest to the semantic vector space of the document to be classified in the vector database and classifying the document to be classified into the category corresponding to the feature vector.
The embodiment also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the document automatic classification method.
The present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the document automatic categorization method described above.
Example 2
This example is based on example 1:
the automatic document classification method of the embodiment comprises the following two stages:
1. system preparation phase
1. A large amount of text is crawled over a network, training a language model based on a self-attention architecture.
2. And crawling a large number of related questions and related document data on various websites, and forming similar text training data in a random sampling mode.
3. The semantic encoder is trained based on a language model through a large amount of similar text training data.
2. Implementation stage
Under the user scene, the documents can be divided into two types, namely financial documents and personnel documents, but only three sample data are respectively provided, and the implementation method is as follows:
1. the system converts three documents of financial class into semantic vectors through a semantic encoder, calculates cluster center vectors vector_counting of the three vectors, and stores the cluster center vectors vector_counting into a vector database. And performing the same operation on the personnel document to obtain a representative vector vector_hr of the personnel document. After the step, the classification system is built.
2. The user newly enters a document D and uploads it to the system. D is converted into semantic vectors through a semantic encoder, and if the nearest class vector is vector_hr through a vector search mode, the system can classify the document D into personnel documents.
Alternatively, if the user has a new type of document, such as a report type document, there are two sample data, the implementation method is as follows:
after a user newly builds a classification and uploads a sample, the semantic encoder converts the data of the two samples into semantic vectors, calculates cluster centers vector_report of the two vectors, and stores the cluster centers vector_report into a vector database. And after uploading the document to be classified newly, the user can add the report class into the consideration range when classifying.
Optionally, if the user adds or deletes the existing classified documents, the implementation method is as follows:
1. and deleting one of the financial documents by the user, automatically calculating the cluster center of the semantic vector of the rest financial documents by the system to obtain a new vector_counting, and storing the new vector_counting in a vector database. When the new document is then categorized, the financial class document will be represented by the new vector_accounting.
2. The user adds a financial document, the semantic encoder converts the new document into a semantic vector, the system automatically calculates cluster centers of semantic vectors of all documents of the financial document containing the newly added document, and the cluster centers are stored in a vector database to obtain a new vector_accounting. When the new document is then categorized, the financial class document will be represented by the new vector_accounting.
The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (6)

1. An automatic document classifying method is characterized by comprising the following steps:
s1, training a language model offline according to similar text data: the method comprises the steps of pre-training in two layers, firstly training a language model through unlabeled text data, and then training to obtain a semantic encoder on the basis of the language model according to labeled data, namely similar text data;
s2, on-line archiving classification based on semantic encoders: classifying the text on a small data set by using an unsupervised method based on the semantic encoder by adopting a nearest neighbor idea;
step S1 comprises the following sub-steps:
s101, training by adopting unlabeled text data: selecting a language model based on a self-attention framework, and training the language model on unlabeled text data to enable the language model to learn the common sense of a target language;
s102, training by using similar text data: obtaining similar text data in the general field to form similar text pairs comprising anchor texts and similar texts, and randomly obtaining a dissimilar text in a corpus aiming at each similar text pair to form training data comprising the anchor texts, the similar texts and the dissimilar texts, so that the anchor texts and the similar texts are semantically related and are semantically unrelated; training a plurality of pieces of training data based on the language model, respectively inputting anchor point text, similar text and dissimilar text into the same language model, and respectively obtaining vectors representing respective semanticsV a , V p , V n Then, calculating a ternary loss function to obtain loss, and retraining the language model to obtain a semantic encoder;
the expression of the ternary loss function is as follows:
wherein,lossto be lost, ||V a - V p || 2 Representation ofV a AndV p distance in space, |V a - V n || 2 Representation ofV a AndV n the distance in the space is such that,marginis constant and represents a desired spatial distance; the ternary loss function can shorten the distance between the anchor point text and the similar text, and the distance between the anchor point text and the dissimilar text is shortened;
step S2 comprises the following sub-steps:
s201, constructing a classification system and uploading a plurality of documents to each classification: uploading the documents to the semantic encoder according to classification, and encoding each document by the semantic encoder to obtain a semantic vector and storing the semantic vector into a vector database according to a classification system; the semantic vector under each category forms a vector set, and the cluster center of the vector set is calculated to be used as the feature vector of the category;
s202, classifying new documents: uploading a document to be classified by a user, and carrying out semantic vector coding by the semantic coder to obtain a semantic vector of the document to be classified; searching a feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into a class corresponding to the feature vector;
the user can modify the classification system or the classified documents, and the feature vectors of the modified classification change accordingly.
2. The method according to claim 1, wherein the semantic encoder learns the ability to encode text semantically, i.e. the closer in space the semantic vectors encoded by the semantic encoder are, the farther in space the semantic vectors are.
3. The method for automatically classifying documents according to claim 1, wherein said obtaining similar text data in the general field comprises: and crawling similar text recommendation information of the website through the crawler.
4. An automatic document classification system based on the automatic document classification method according to claim 1, comprising a semantic encoder, a vector database and a vector retrieval module;
the semantic encoder is trained on the basis of a language model according to unlabeled text data and similar text data, and is used for storing semantic vectors obtained by encoding classified documents into the vector database according to a classification system and outputting the semantic vectors obtained by encoding the documents to be classified to the vector retrieval module;
in the vector database, semantic vectors under each category form a vector set, and the cluster center of the vector set is used as the characteristic vector of the category;
the vector retrieval module is used for searching the feature vector closest to the semantic vector space of the document to be classified in the vector database, and classifying the document to be classified into the category corresponding to the feature vector.
5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-3 when the computer program is executed.
6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-3.
CN202010983960.XA 2020-09-18 2020-09-18 Automatic document classification method, system, computer equipment and storage medium Active CN112084338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010983960.XA CN112084338B (en) 2020-09-18 2020-09-18 Automatic document classification method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010983960.XA CN112084338B (en) 2020-09-18 2020-09-18 Automatic document classification method, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112084338A CN112084338A (en) 2020-12-15
CN112084338B true CN112084338B (en) 2024-02-06

Family

ID=73736568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010983960.XA Active CN112084338B (en) 2020-09-18 2020-09-18 Automatic document classification method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112084338B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233304B (en) * 2022-11-30 2024-04-05 荣耀终端有限公司 Schedule-based equipment state synchronization system, method and device
CN116910275B (en) * 2023-09-12 2023-12-15 无锡容智技术有限公司 Form generation method and system based on large language model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109003625A (en) * 2018-07-27 2018-12-14 中国科学院自动化研究所 Speech-emotion recognition method and system based on ternary loss
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110889443A (en) * 2019-11-21 2020-03-17 成都数联铭品科技有限公司 Unsupervised text classification system and unsupervised text classification method
CN111259850A (en) * 2020-01-23 2020-06-09 同济大学 Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11087088B2 (en) * 2018-09-25 2021-08-10 Accenture Global Solutions Limited Automated and optimal encoding of text data features for machine learning models

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109003625A (en) * 2018-07-27 2018-12-14 中国科学院自动化研究所 Speech-emotion recognition method and system based on ternary loss
CN110457475A (en) * 2019-07-25 2019-11-15 阿里巴巴集团控股有限公司 A kind of method and system expanded for text classification system construction and mark corpus
CN110889443A (en) * 2019-11-21 2020-03-17 成都数联铭品科技有限公司 Unsupervised text classification system and unsupervised text classification method
CN111259850A (en) * 2020-01-23 2020-06-09 同济大学 Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Comparison and Classification of Documents Based on Layout Similarity;Jianying Hu 等;《Information Retrieval》;227–243 *
Semi-supervised triplet loss based learning of ambient audio embeddings;N Turpault 等;《ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;760-764 *
基于深度学习和词嵌入的视觉语义嵌入研究;杨战波;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-165 *
融合词向量与关键词提取的微博话题发现;王立平 等;《现代计算机》;3-9 *

Also Published As

Publication number Publication date
CN112084338A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US11640494B1 (en) Systems and methods for construction, maintenance, and improvement of knowledge representations
AU2016256753B2 (en) Image captioning using weak supervision and semantic natural language vector space
US9792534B2 (en) Semantic natural language vector space
US11238211B2 (en) Automatic hyperlinking of documents
US20190325342A1 (en) Embedding multimodal content in a common non-euclidean geometric space
Li et al. The automatic text classification method based on bert and feature union
GB2546360A (en) Image captioning with weak supervision
WO2023065211A1 (en) Information acquisition method and apparatus
Li et al. Image sentiment prediction based on textual descriptions with adjective noun pairs
CN111625715B (en) Information extraction method and device, electronic equipment and storage medium
CN112084338B (en) Automatic document classification method, system, computer equipment and storage medium
Zhu et al. Multi-attention based semantic deep hashing for cross-modal retrieval
CN115689672A (en) Chat type commodity shopping guide method and device, equipment and medium thereof
Tian et al. Sequential deep learning for disaster-related video classification
Wang et al. A text classification method based on LSTM and graph attention network
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
US20230162518A1 (en) Systems for Generating Indications of Relationships between Electronic Documents
CN116977701A (en) Video classification model training method, video classification method and device
US20230237093A1 (en) Video recommender system by knowledge based multi-modal graph neural networks
CN114329181A (en) Question recommendation method and device and electronic equipment
CN111666452A (en) Method and device for clustering videos
Zuo et al. Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data
Sangeetha et al. Sentiment Analysis on Movie Reviews: A Comparative Analysis
Kothari et al. Multimodal context extraction for virtual meetings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant