CN112000807A - Method for accurately classifying proposal - Google Patents

Method for accurately classifying proposal Download PDF

Info

Publication number
CN112000807A
CN112000807A CN202010927607.XA CN202010927607A CN112000807A CN 112000807 A CN112000807 A CN 112000807A CN 202010927607 A CN202010927607 A CN 202010927607A CN 112000807 A CN112000807 A CN 112000807A
Authority
CN
China
Prior art keywords
text
feature
texts
classification
proposal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010927607.XA
Other languages
Chinese (zh)
Inventor
恒晓楠
刘永亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Guonuo Technology Co ltd
Original Assignee
Liaoning Guonuo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Guonuo Technology Co ltd filed Critical Liaoning Guonuo Technology Co ltd
Priority to CN202010927607.XA priority Critical patent/CN112000807A/en
Publication of CN112000807A publication Critical patent/CN112000807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for accurately classifying proposal proposals, which comprises the following steps: s1, obtaining a proposal text sample; s2, establishing a text representation model; s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified; s4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposal; according to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.

Description

Method for accurately classifying proposal
Technical Field
The invention relates to the technical field of protocol classification algorithms, in particular to a method for accurately classifying proposal suggestions.
Background
In the current work of the representative proposal, the informatization management of each work link of the representative proposal is basically realized, and the method plays an active role in improving the efficiency of the proposal processing work in a certain period. With the continuous deepening of the work of the representative protocol, how to improve the work efficiency of the work links (such as protocol review, protocol classification, etc.) of business processing, data analysis, etc. in the traditional management mode by means of scientific means, a new informatization means is urgently needed for assistance in solution. The text classification method in the prior art has low accuracy, so that the working efficiency is low.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a proposed precise classification method. According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.
The above object of the present invention is achieved by the following technical solutions:
a proposed precise classification method specifically comprises the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
Further, the step S1 is specifically:
and acquiring a suggested proposal text sample, and performing data cleaning and noise data removal on the text sample to obtain a clean text sample.
Further, the step S2 specifically includes:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k is 1, …, n) is a feature of the document space, wk is a weight of tk, and the text set D may be regarded as a vector space formed by a set of orthogonal words to form a text representation model;
further, the step S3 specifically includes:
suppose that a text segment s in the text set D is composed of n ordered words, and is denoted as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
Figure BDA0002669006130000021
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
Figure BDA0002669006130000022
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the appearance of a feature i in a text collection DThe number of times;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
Figure BDA0002669006130000031
wherein the content of the first and second substances,
Figure BDA0002669006130000032
(k1and b takes a default value, k11.2, b 0.95), dl is the length of the document, and avg _ dl is the average length of the entire document.
Further, the step S4 specifically includes the following steps:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency is
Figure BDA0002669006130000041
Wherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
The invention has the beneficial effects that: the classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.
According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the data preprocessing of step S1;
FIG. 3 is a classifier flowchart of step S4;
Detailed Description
The details and embodiments of the present invention are further described with reference to the accompanying drawings and the following embodiments.
Examples
Referring to fig. 1, the present embodiment provides a proposed precise classification method, which specifically includes the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
The step S1 specifically includes:
the method comprises the steps of obtaining a text sample of a proposal, wherein collected data may have various situations such as no pre-classification, no topic title or messy codes, and the like, which have negative influence on classification effect, so that the data needs to be cleaned and filtered, and as shown in fig. 2, the text sample is cleaned and noise data is removed to obtain a clean text sample.
The step S2 specifically includes:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k is 1, …, n) is a feature of the document space, wk is a weight of tk, and the text set D may be regarded as a vector space formed by a set of orthogonal words to form a text representation model;
the step S3 specifically includes:
extracting text characteristics of the text representation model formed in the step 2, converting the text into a mathematical model, wherein the text D is often a high-dimensional space, and features need to be selected to select more representative features so as to achieve the purpose of reducing dimensions; in addition, each feature in the text space has a different importance level in each text vector, and the text features also need to be weighted.
Suppose that a text segment s in the text set D is composed of n ordered words, and is denoted as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
Figure BDA0002669006130000061
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
Figure BDA0002669006130000062
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the number of times the feature i appears in the text set D;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
Figure BDA0002669006130000063
wherein the content of the first and second substances,
Figure BDA0002669006130000064
(k1and b takes a default value, k11.2, b 0.95), dl is the length of the document, and avg _ dl is the average length of the entire document.
As shown in fig. 3, the step S4 specifically includes the following steps:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency is
Figure BDA0002669006130000071
Wherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
The classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.
The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like of the present invention shall be included in the protection scope of the present invention.

Claims (5)

1. A proposed precise classification method specifically comprises the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
2. The method for accurately classifying proposed solutions according to claim 1, wherein the step S1 is specifically as follows:
and acquiring a suggested proposal text sample, and performing data cleaning and noise data removal on the text sample to obtain a clean text sample.
3. The method for accurately classifying proposed solutions according to claim 1, wherein the step S2 specifically comprises:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k ═ 1, …, n) is a feature of the document space, wk is the weight of tk, and the text set D can be regarded as a vector space composed of a set of orthogonal words, constituting a text representation model.
4. The method for accurately classifying proposed solutions according to claim 1, wherein the step S3 specifically comprises:
suppose, a text in the text set DThe segment s is composed of n ordered words and is marked as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
Figure FDA0002669006120000021
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
Figure FDA0002669006120000022
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the number of times the feature i appears in the text set D;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
Figure FDA0002669006120000023
wherein the content of the first and second substances,
Figure FDA0002669006120000024
k1and b takes a default value, k11.2, b 0.95, dl is the length of the document, and avg _ dl is the average length of all documents.
5. The method for accurately classifying proposed solutions according to claim 1, wherein the step S4 specifically comprises the steps of:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency is
Figure FDA0002669006120000031
Wherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
CN202010927607.XA 2020-09-07 2020-09-07 Method for accurately classifying proposal Pending CN112000807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010927607.XA CN112000807A (en) 2020-09-07 2020-09-07 Method for accurately classifying proposal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010927607.XA CN112000807A (en) 2020-09-07 2020-09-07 Method for accurately classifying proposal

Publications (1)

Publication Number Publication Date
CN112000807A true CN112000807A (en) 2020-11-27

Family

ID=73469083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010927607.XA Pending CN112000807A (en) 2020-09-07 2020-09-07 Method for accurately classifying proposal

Country Status (1)

Country Link
CN (1) CN112000807A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468326A (en) * 2021-06-16 2021-10-01 北京明略软件系统有限公司 Method and device for determining document classification
CN117093716A (en) * 2023-10-19 2023-11-21 湖南正宇软件技术开发有限公司 Proposed automatic classification method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866606A (en) * 2015-06-02 2015-08-26 浙江师范大学 MapReduce parallel big data text classification method
CN105205124A (en) * 2015-09-11 2015-12-30 合肥工业大学 Semi-supervised text sentiment classification method based on random feature subspace
WO2017113232A1 (en) * 2015-12-30 2017-07-06 中国科学院深圳先进技术研究院 Product classification method and apparatus based on deep learning
CN111104510A (en) * 2019-11-15 2020-05-05 南京中新赛克科技有限责任公司 Word embedding-based text classification training sample expansion method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866606A (en) * 2015-06-02 2015-08-26 浙江师范大学 MapReduce parallel big data text classification method
CN105205124A (en) * 2015-09-11 2015-12-30 合肥工业大学 Semi-supervised text sentiment classification method based on random feature subspace
WO2017113232A1 (en) * 2015-12-30 2017-07-06 中国科学院深圳先进技术研究院 Product classification method and apparatus based on deep learning
CN111104510A (en) * 2019-11-15 2020-05-05 南京中新赛克科技有限责任公司 Word embedding-based text classification training sample expansion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王振浩: "基于情感字典与机器学习相结合的文本情感分类", 中国优秀硕士学位论文全文数据库, pages 138 - 2607 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468326A (en) * 2021-06-16 2021-10-01 北京明略软件系统有限公司 Method and device for determining document classification
CN117093716A (en) * 2023-10-19 2023-11-21 湖南正宇软件技术开发有限公司 Proposed automatic classification method, device, computer equipment and storage medium
CN117093716B (en) * 2023-10-19 2023-12-26 湖南正宇软件技术开发有限公司 Proposed automatic classification method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN109783639B (en) Mediated case intelligent dispatching method and system based on feature extraction
CN108897784B (en) Emergency multidimensional analysis system based on social media
CN111414479A (en) Label extraction method based on short text clustering technology
CN104881458B (en) A kind of mask method and device of Web page subject
US20040267686A1 (en) News group clustering based on cross-post graph
CN106156372B (en) A kind of classification method and device of internet site
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN107145516B (en) Text clustering method and system
CN106909669B (en) Method and device for detecting promotion information
CN109902289A (en) A kind of news video topic division method towards fuzzy text mining
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN115796181A (en) Text relation extraction method for chemical field
CN106844482B (en) Search engine-based retrieval information matching method and device
CN112000807A (en) Method for accurately classifying proposal
CN111626050A (en) Microblog emotion analysis method based on expression dictionary and emotion common sense
CN111460158A (en) Microblog topic public emotion prediction method based on emotion analysis
Rochmawati et al. Opinion analysis on Rohingya using Twitter data
CN115544348A (en) Intelligent mass information searching system based on Internet big data
CN114757302A (en) Clustering method system for text processing
CN104866606A (en) MapReduce parallel big data text classification method
CN111475607B (en) Web data clustering method based on Mashup service function feature representation and density peak detection
CN106294689B (en) A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination