CN112000807A - Method for accurately classifying proposal - Google Patents
Method for accurately classifying proposal Download PDFInfo
- Publication number
- CN112000807A CN112000807A CN202010927607.XA CN202010927607A CN112000807A CN 112000807 A CN112000807 A CN 112000807A CN 202010927607 A CN202010927607 A CN 202010927607A CN 112000807 A CN112000807 A CN 112000807A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- texts
- classification
- proposal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000002372 labelling Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for accurately classifying proposal proposals, which comprises the following steps: s1, obtaining a proposal text sample; s2, establishing a text representation model; s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified; s4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposal; according to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.
Description
Technical Field
The invention relates to the technical field of protocol classification algorithms, in particular to a method for accurately classifying proposal suggestions.
Background
In the current work of the representative proposal, the informatization management of each work link of the representative proposal is basically realized, and the method plays an active role in improving the efficiency of the proposal processing work in a certain period. With the continuous deepening of the work of the representative protocol, how to improve the work efficiency of the work links (such as protocol review, protocol classification, etc.) of business processing, data analysis, etc. in the traditional management mode by means of scientific means, a new informatization means is urgently needed for assistance in solution. The text classification method in the prior art has low accuracy, so that the working efficiency is low.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a proposed precise classification method. According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.
The above object of the present invention is achieved by the following technical solutions:
a proposed precise classification method specifically comprises the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
Further, the step S1 is specifically:
and acquiring a suggested proposal text sample, and performing data cleaning and noise data removal on the text sample to obtain a clean text sample.
Further, the step S2 specifically includes:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k is 1, …, n) is a feature of the document space, wk is a weight of tk, and the text set D may be regarded as a vector space formed by a set of orthogonal words to form a text representation model;
further, the step S3 specifically includes:
suppose that a text segment s in the text set D is composed of n ordered words, and is denoted as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the appearance of a feature i in a text collection DThe number of times;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
wherein the content of the first and second substances,(k1and b takes a default value, k11.2, b 0.95), dl is the length of the document, and avg _ dl is the average length of the entire document.
Further, the step S4 specifically includes the following steps:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency isWherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
The invention has the beneficial effects that: the classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.
According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the data preprocessing of step S1;
FIG. 3 is a classifier flowchart of step S4;
Detailed Description
The details and embodiments of the present invention are further described with reference to the accompanying drawings and the following embodiments.
Examples
Referring to fig. 1, the present embodiment provides a proposed precise classification method, which specifically includes the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
The step S1 specifically includes:
the method comprises the steps of obtaining a text sample of a proposal, wherein collected data may have various situations such as no pre-classification, no topic title or messy codes, and the like, which have negative influence on classification effect, so that the data needs to be cleaned and filtered, and as shown in fig. 2, the text sample is cleaned and noise data is removed to obtain a clean text sample.
The step S2 specifically includes:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k is 1, …, n) is a feature of the document space, wk is a weight of tk, and the text set D may be regarded as a vector space formed by a set of orthogonal words to form a text representation model;
the step S3 specifically includes:
extracting text characteristics of the text representation model formed in the step 2, converting the text into a mathematical model, wherein the text D is often a high-dimensional space, and features need to be selected to select more representative features so as to achieve the purpose of reducing dimensions; in addition, each feature in the text space has a different importance level in each text vector, and the text features also need to be weighted.
Suppose that a text segment s in the text set D is composed of n ordered words, and is denoted as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the number of times the feature i appears in the text set D;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
wherein the content of the first and second substances,(k1and b takes a default value, k11.2, b 0.95), dl is the length of the document, and avg _ dl is the average length of the entire document.
As shown in fig. 3, the step S4 specifically includes the following steps:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency isWherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
The classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.
The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like of the present invention shall be included in the protection scope of the present invention.
Claims (5)
1. A proposed precise classification method specifically comprises the following steps:
s1, obtaining a proposal text sample;
s2, establishing a text representation model;
s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;
and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.
2. The method for accurately classifying proposed solutions according to claim 1, wherein the step S1 is specifically as follows:
and acquiring a suggested proposal text sample, and performing data cleaning and noise data removal on the text sample to obtain a clean text sample.
3. The method for accurately classifying proposed solutions according to claim 1, wherein the step S2 specifically comprises:
in the vector space model, for a text set D, each text D is represented as a vector
d=((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)
Where tk (k ═ 1, …, n) is a feature of the document space, wk is the weight of tk, and the text set D can be regarded as a vector space composed of a set of orthogonal words, constituting a text representation model.
4. The method for accurately classifying proposed solutions according to claim 1, wherein the step S3 specifically comprises:
suppose, a text in the text set DThe segment s is composed of n ordered words and is marked as a word sequence w1,w2,…,wnIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w1,w2,…wnExpressing the weight formula as the weight w of the characteristic i in the text set DiThe formula is as follows:
the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;
the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:
wherein N is1And N2Total number of documents, df, in each of the two training set classesi,1And dfi,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf isiRepresenting the number of times the feature i appears in the text set D;
introduction of BM25 mode, wiThe representative model of (a) is as follows:
5. The method for accurately classifying proposed solutions according to claim 1, wherein the step S4 specifically comprises the steps of:
firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;
classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:
(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;
(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;
(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);
(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;
and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency isWherein FpIs the number of documents in which the feature word appears in the subject word class, FnThe number of documents of which the feature words appear in the non-subject word class;
and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010927607.XA CN112000807A (en) | 2020-09-07 | 2020-09-07 | Method for accurately classifying proposal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010927607.XA CN112000807A (en) | 2020-09-07 | 2020-09-07 | Method for accurately classifying proposal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112000807A true CN112000807A (en) | 2020-11-27 |
Family
ID=73469083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010927607.XA Pending CN112000807A (en) | 2020-09-07 | 2020-09-07 | Method for accurately classifying proposal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112000807A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113468326A (en) * | 2021-06-16 | 2021-10-01 | 北京明略软件系统有限公司 | Method and device for determining document classification |
CN117093716A (en) * | 2023-10-19 | 2023-11-21 | 湖南正宇软件技术开发有限公司 | Proposed automatic classification method, device, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866606A (en) * | 2015-06-02 | 2015-08-26 | 浙江师范大学 | MapReduce parallel big data text classification method |
CN105205124A (en) * | 2015-09-11 | 2015-12-30 | 合肥工业大学 | Semi-supervised text sentiment classification method based on random feature subspace |
WO2017113232A1 (en) * | 2015-12-30 | 2017-07-06 | 中国科学院深圳先进技术研究院 | Product classification method and apparatus based on deep learning |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
-
2020
- 2020-09-07 CN CN202010927607.XA patent/CN112000807A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866606A (en) * | 2015-06-02 | 2015-08-26 | 浙江师范大学 | MapReduce parallel big data text classification method |
CN105205124A (en) * | 2015-09-11 | 2015-12-30 | 合肥工业大学 | Semi-supervised text sentiment classification method based on random feature subspace |
WO2017113232A1 (en) * | 2015-12-30 | 2017-07-06 | 中国科学院深圳先进技术研究院 | Product classification method and apparatus based on deep learning |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
Non-Patent Citations (1)
Title |
---|
王振浩: "基于情感字典与机器学习相结合的文本情感分类", 中国优秀硕士学位论文全文数据库, pages 138 - 2607 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113468326A (en) * | 2021-06-16 | 2021-10-01 | 北京明略软件系统有限公司 | Method and device for determining document classification |
CN117093716A (en) * | 2023-10-19 | 2023-11-21 | 湖南正宇软件技术开发有限公司 | Proposed automatic classification method, device, computer equipment and storage medium |
CN117093716B (en) * | 2023-10-19 | 2023-12-26 | 湖南正宇软件技术开发有限公司 | Proposed automatic classification method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189942B (en) | Construction method and device of patent data knowledge graph | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN109783639B (en) | Mediated case intelligent dispatching method and system based on feature extraction | |
CN108897784B (en) | Emergency multidimensional analysis system based on social media | |
CN111414479A (en) | Label extraction method based on short text clustering technology | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
US20040267686A1 (en) | News group clustering based on cross-post graph | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN112395395B (en) | Text keyword extraction method, device, equipment and storage medium | |
CN107145516B (en) | Text clustering method and system | |
CN106909669B (en) | Method and device for detecting promotion information | |
CN109902289A (en) | A kind of news video topic division method towards fuzzy text mining | |
CN111767725A (en) | Data processing method and device based on emotion polarity analysis model | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN115796181A (en) | Text relation extraction method for chemical field | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN112000807A (en) | Method for accurately classifying proposal | |
CN111626050A (en) | Microblog emotion analysis method based on expression dictionary and emotion common sense | |
CN111460158A (en) | Microblog topic public emotion prediction method based on emotion analysis | |
Rochmawati et al. | Opinion analysis on Rohingya using Twitter data | |
CN115544348A (en) | Intelligent mass information searching system based on Internet big data | |
CN114757302A (en) | Clustering method system for text processing | |
CN104866606A (en) | MapReduce parallel big data text classification method | |
CN111475607B (en) | Web data clustering method based on Mashup service function feature representation and density peak detection | |
CN106294689B (en) | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |