CN112000807A

CN112000807A - Method for accurately classifying proposal

Info

Publication number: CN112000807A
Application number: CN202010927607.XA
Authority: CN
Inventors: 恒晓楠; 刘永亮
Original assignee: Liaoning Guonuo Technology Co ltd
Current assignee: Liaoning Guonuo Technology Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-11-27

Abstract

The invention relates to a method for accurately classifying proposal proposals, which comprises the following steps: s1, obtaining a proposal text sample; s2, establishing a text representation model; s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified; s4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposal; according to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.

Description

Method for accurately classifying proposal

Technical Field

The invention relates to the technical field of protocol classification algorithms, in particular to a method for accurately classifying proposal suggestions.

Background

In the current work of the representative proposal, the informatization management of each work link of the representative proposal is basically realized, and the method plays an active role in improving the efficiency of the proposal processing work in a certain period. With the continuous deepening of the work of the representative protocol, how to improve the work efficiency of the work links (such as protocol review, protocol classification, etc.) of business processing, data analysis, etc. in the traditional management mode by means of scientific means, a new informatization means is urgently needed for assistance in solution. The text classification method in the prior art has low accuracy, so that the working efficiency is low.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a proposed precise classification method. According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.

The above object of the present invention is achieved by the following technical solutions:

a proposed precise classification method specifically comprises the following steps:

s1, obtaining a proposal text sample;

s2, establishing a text representation model;

s3, performing text feature extraction on the text representation model and calculating weight to obtain a text sample set to be classified;

and S4, constructing a classifier model, automatically classifying the text samples, and finally obtaining a classification result of the proposed proposal.

Further, the step S1 is specifically:

and acquiring a suggested proposal text sample, and performing data cleaning and noise data removal on the text sample to obtain a clean text sample.

Further, the step S2 specifically includes:

in the vector space model, for a text set D, each text D is represented as a vector

d＝((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)

Where tk (k is 1, …, n) is a feature of the document space, wk is a weight of tk, and the text set D may be regarded as a vector space formed by a set of orthogonal words to form a text representation model;

further, the step S3 specifically includes:

suppose that a text segment s in the text set D is composed of n ordered words, and is denoted as a word sequence w₁,w₂,…,w_nIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w₁,w₂,…w_nExpressing the weight formula as the weight w of the characteristic i in the text set D_iThe formula is as follows:

the tf represents the number of times of the feature i appearing in the text set D, idef represents the text frequency of the feature i appearing in all the documents, N represents the total number of all the documents, and df represents the number of the documents containing the feature i;

the formula (2) reflects the distribution of all documents of the feature i in all classes, and cannot represent the additional information of the feature i in a certain class, so that the text in the file set D is randomly divided into two training set classes, the IDFs in the two training set classes are respectively calculated to localize the IDFs, and then the two values are subtracted to obtain the weight of the feature i in the text set D, which can be expressed as follows:

wherein N is₁And N₂Total number of documents, df, in each of the two training set classes_i,1And df_i,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf is_iRepresenting the appearance of a feature i in a text collection DThe number of times;

introduction of BM25 mode, w_iThe representative model of (a) is as follows:

wherein the content of the first and second substances,

(k₁and b takes a default value, k₁1.2, b 0.95), dl is the length of the document, and avg _ dl is the average length of the entire document.

Further, the step S4 specifically includes the following steps:

firstly, dividing punctuations in the text sample processed in the step S3 into text segments to obtain a preprocessed text set, and establishing a subject word dictionary comprising an initial subject part of speech and a non-subject part of speech;

classifying and calculating all the characteristics of the preprocessed text set to obtain a determined classification set and an uncertain classification set, specifically:

(1) calculating the feature score of all the features of the preprocessed text set by adopting a formula (4), and if the score is regular, marking the score as positive, and if the score is negative, marking the score as negative;

(2) calculating Cmin (Cnegtive), namely calculating the number of the texts marked as positive and the number of the texts marked as negative, wherein the number of the texts is used as the number of the texts to be taken in the determined classification set; wherein Cpositive represents the number of texts marked as positive and Cnegative represents the number of texts marked as negative;

(3) meanwhile, sorting all the features of the preprocessed text set from big to small according to the feature scores obtained by calculation in the step (1);

(4) characteristic polarity labeling: according to the feature score sorting result of the step (3), according to the text quantity Cmin obtained by calculation in the step (2), Cmin texts with the highest scores and Cmin texts with the lowest scores are obtained from the sorted preprocessed text set, the Cmin texts with the highest scores are marked as positive, the Cmin texts with the lowest scores are marked as negative, a determined classification set is formed, and the rest texts are marked as uncertain, so that an uncertain classification set is formed;

and thirdly, expanding all the feature words with the absolute word frequency larger than 2 in the text of the determined classification set into the subject word dictionary as candidate feature words, and updating the subject word dictionary, wherein the calculation formula of the absolute word frequency is

Wherein F_pIs the number of documents in which the feature word appears in the subject word class, F_nThe number of documents of which the feature words appear in the non-subject word class;

and fourthly, performing next classification calculation on the text of the uncertain classification set, entering an iteration process, and finishing the iteration process to obtain a final classification set and a suggested proposal classification result under the condition that the subject word dictionary is not expanded and the classification result is not changed.

The invention has the beneficial effects that: the classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.

According to the method, a text representation model, a text feature extraction model and a classifier model are constructed, a large amount of useful information is automatically obtained from massive unstructured texts, the classification efficiency and the accuracy of the proposal are improved, and technical support is provided for the follow-up work of managing, counting, inquiring, analyzing and the like of the proposal.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the data preprocessing of step S1;

FIG. 3 is a classifier flowchart of step S4;

Detailed Description

The details and embodiments of the present invention are further described with reference to the accompanying drawings and the following embodiments.

Examples

Referring to fig. 1, the present embodiment provides a proposed precise classification method, which specifically includes the following steps:

s1, obtaining a proposal text sample;

s2, establishing a text representation model;

The step S1 specifically includes:

the method comprises the steps of obtaining a text sample of a proposal, wherein collected data may have various situations such as no pre-classification, no topic title or messy codes, and the like, which have negative influence on classification effect, so that the data needs to be cleaned and filtered, and as shown in fig. 2, the text sample is cleaned and noise data is removed to obtain a clean text sample.

The step S2 specifically includes:

d＝((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)

the step S3 specifically includes:

extracting text characteristics of the text representation model formed in the step 2, converting the text into a mathematical model, wherein the text D is often a high-dimensional space, and features need to be selected to select more representative features so as to achieve the purpose of reducing dimensions; in addition, each feature in the text space has a different importance level in each text vector, and the text features also need to be weighted.

wherein N is₁And N₂Total number of documents, df, in each of the two training set classes_i,1And df_i,2Respectively indicating the total number of documents containing the characteristic i in the two training sets; tf is_iRepresenting the number of times the feature i appears in the text set D;

introduction of BM25 mode, w_iThe representative model of (a) is as follows:

wherein the content of the first and second substances,

As shown in fig. 3, the step S4 specifically includes the following steps:

The classification method of the invention not only has higher classification accuracy rate for some texts with definite characteristic tendency, but also effectively classifies some ambiguous texts, namely the texts with positive characteristic words and negative characteristic words, by taking part with high classification accuracy rate as a training set, and effectively classifies the rest texts with intermediate scores and uncertain characteristic polarity. Candidate feature words are obtained by screening from the determined classification set obtained after classification and are updated to the subject word dictionary, the expanded subject word dictionary can help to classify more texts, and in the iteration process, the subject word dictionary and the classification set are updated again and again, so that the classification accuracy rate can be obviously improved.

The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like of the present invention shall be included in the protection scope of the present invention.

Claims

1. A proposed precise classification method specifically comprises the following steps:

s1, obtaining a proposal text sample;

s2, establishing a text representation model;

2. The method for accurately classifying proposed solutions according to claim 1, wherein the step S1 is specifically as follows:

3. The method for accurately classifying proposed solutions according to claim 1, wherein the step S2 specifically comprises:

d＝((t1:w1),(t2:w2),…,(tk:wk),…,(tn:wn)) (1)

Where tk (k ═ 1, …, n) is a feature of the document space, wk is the weight of tk, and the text set D can be regarded as a vector space composed of a set of orthogonal words, constituting a text representation model.

4. The method for accurately classifying proposed solutions according to claim 1, wherein the step S3 specifically comprises:

suppose, a text in the text set DThe segment s is composed of n ordered words and is marked as a word sequence w₁,w₂,…,w_nIn the text feature representation method, the word bag method is assumed to be mutually independent one-dimensional features among words, so that the feature set of a text segment s can be represented as { w₁,w₂,…w_nExpressing the weight formula as the weight w of the characteristic i in the text set D_iThe formula is as follows:

introduction of BM25 mode, w_iThe representative model of (a) is as follows:

wherein the content of the first and second substances,

k₁and b takes a default value, k₁1.2, b 0.95, dl is the length of the document, and avg _ dl is the average length of all documents.

5. The method for accurately classifying proposed solutions according to claim 1, wherein the step S4 specifically comprises the steps of: