CN111309904A

CN111309904A - Public data classification method based on generalized characteristic word stock

Info

Publication number: CN111309904A
Application number: CN202010066137.2A
Authority: CN
Inventors: 陈磊; 刘迎风; 储昭武; 管红; 潘佳; 唐若培; 徐洁
Original assignee: Shanghai Big Data Center
Current assignee: Shanghai Big Data Center
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-19

Abstract

The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock, which comprises the following steps: step S1, establishing a generalized characteristic word stock; step S2, preprocessing a text to be classified, calling the generalized feature lexicon to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified; step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm; and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded. The invention can greatly improve the efficiency, speed and accuracy of public data classification.

Description

Public data classification method based on generalized characteristic word stock

Technical Field

The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock.

Background

With the advance of urban digital transformation and centralized and unified management of public data, the problem of classification and classification of public data is urgently needed to be solved, and particularly, the safety classification of a public data directory is used for determining which data can be shared and opened unconditionally, and which data are applicable to conditional sharing and opening or unopened and not shared according to personal privacy, core business confidentiality or related law and regulation, so that data authorization and sharing opening are carried out by combining different application scenes, and energized urban management of data and formation of data operation ecology are realized. In the classification process of public data, currently, manual classification is mainly performed by means of knowledge background and related reference regulations of professionals, the manual classification mode depends on the capability of workers, and the workload is huge and the efficiency is low.

Therefore, the text classification technology based on the natural language processing field in the artificial intelligence is provided, the efficiency and the speed of public data classification can be greatly improved, and meanwhile, the classification accuracy is improved.

At present, the implementation methods related to text classification technologies are mainly classified into statistical learning methods and deep learning methods. The former mainly uses a feature selection method as a main part, selects word and sentence level features of a text through indexes such as TF-IDF (term frequency-inverse document frequency, a commonly used weighting technology for information retrieval data mining), PMI (Purchasing Managers' Index), chi-square value and the like to obtain a feature vector representing the text, and obtains the probability of each label of the feature vector by using a machine learning method to serve as a final classification standard; the latter takes model construction as the main, takes discrete information of the text as input, and updates the network weight through the serial and parallel structure of the multilayer neural network and the back propagation algorithm, so as to directly obtain the probability of the text on each label.

However, in the public data classification, it is necessary to extract not only the feature words in the classified data description but also the feature words in the related legal provision and appropriately increase the weight of these feature words. Thus, a text classification method is proposed herein that is specifically used to rank common data.

Disclosure of Invention

In order to solve the technical problems, the invention provides a public data classification method based on a generalized characteristic word bank.

The technical problem solved by the invention can be realized by adopting the following technical scheme:

a public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:

step S1, establishing a generalized characteristic word stock;

step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;

step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;

and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.

Preferably, the step S1 includes:

step S10, obtaining a corpus, and eliminating homogenization data in the corpus;

step S11, classifying the corpora after the homogenization data is eliminated;

step S12, sorting the classified data in the corpus;

step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;

and step S14, processing the document to establish the generalized characteristic word stock.

Preferably, in step S10, a large amount of text data is obtained from public data as the corpus, and N values are preset to remove the homogeneous data in the corpus.

Preferably, the step S2 of preprocessing the text to be classified includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.

Preferably, in step S3, a word segmentation method is used to segment the new text to be classified.

Preferably, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate the text word set.

Preferably, in step S3, the text word set is segmented one by one with spaces as stoppers to form the characters, the characters of each line in the new text to be classified are read, the frequency of occurrence of each character is calculated through the TF-IDF algorithm, and the feature vector matrix is established.

Preferably, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-classification label is converted into another input vector of the text classifier, the text classification model is generated by invoking the text classifier training algorithm, and the classification result of the text to be classified is output.

Preferably, the classification result is the accuracy and TopN ranking of the text classification to be classified.

Preferably, the text classifier is a support vector machine classifier.

The beneficial effects are that:

the public data classification method based on the generalized characteristic lexicon can greatly improve the efficiency, speed and accuracy of public data classification.

Drawings

FIG. 1 is a diagram of steps of a public data classification method based on a generalized characteristic lexicon according to the present invention;

fig. 2 is a flowchart illustrating an embodiment of step S1 in fig. 1.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

Referring to fig. 1, a step diagram of a public data classification method based on a generalized feature thesaurus provided by the present invention includes:

step S1, establishing a generalized characteristic word stock;

Referring to fig. 2, a flowchart of an embodiment of step S1 in fig. 1 includes:

step S11, classifying the corpora after the homogenization data is eliminated;

step S12, sorting the classified data in the corpus;

Further, in step S10, a large amount of text data is obtained from the public data as a corpus, and the predetermined N value is used to remove the homogeneous data in the corpus.

Specifically, presetting an N value, sequencing data in the corpus through a TopN algorithm, extracting the first N data in the sequenced corpus, storing the data in a document, packaging the document, compiling codes and packaging the document, and further generating a generalized characteristic word bank; the generalized characteristic word bank provided by the invention can process the characteristic word bank of non-limited text data, can process various text data, filters out some conventional non-specific words, and can accelerate the speed of classifying the text to be classified by processing the text to be detected by using the generalized characteristic word bank.

Further, the preprocessing the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.

Specifically, firstly, preprocessing is performed on the text to be classified, namely, the preprocessing includes removing sensitive words, messy codes, punctuation marks and the like, so as to remove redundant parts in the text to be classified, and the text to be classified is subjected to regular matching through the generalized feature word library, so that the text to be classified can be further filtered, the multi-classification label of the text to be classified and the new text to be classified can be obtained, and a guarantee is provided for establishing a text classification model.

Further, in step S3, a word segmentation method is used to segment the new text to be classified.

Further, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate a text word set.

Specifically, a word segmentation method can be used for segmenting the new text to be classified; the method comprises the steps of carrying out full segmentation on sentences in a new text to be classified to generate a word graph represented by an adjacent linked list, namely a text word set, then carrying out one-to-one segmentation on the text word set by taking a blank as a stop sign, namely adding the blank in the middle of a word of the text word set, reading characters in each line of the new text to be classified by establishing a TF-IDF structure, calculating frequency of each character and establishing a feature vector matrix.

Further, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, a text classification model is generated by invoking a text classifier training algorithm, and a classification result of the text to be classified is output.

Further, the grading result is the accuracy rate and the TopN ordering of the text classification to be classified.

Further, the text classifier is a support vector machine classifier.

Specifically, the feature vector matrix is converted into a vector x, the multi-classification label is converted into a vector y, the vector x and the vector y are input into the support vector machine classifier as input vectors by calling a mode identification and regression software package and an svm _ train (y, x) training algorithm in the support vector machine classifier, a text classification model is further generated, and the accuracy of text classification to be classified and a TopN ranking list are obtained. By using the support vector machine classifier, a digital optimization algorithm is not needed, and the storage of a matrix is not needed, so that the text classification efficiency is improved.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:

step S1, establishing a generalized characteristic word stock;

2. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said step S1 comprises:

step S11, classifying the corpora after the homogenization data is eliminated;

step S12, sorting the classified data in the corpus;

3. The method as claimed in claim 1, wherein in step S10, a large amount of text data is obtained from public data as the corpus, and N is preset to remove the homogeneous data in the corpus.

4. The method as claimed in claim 1, wherein the preprocessing of the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.

5. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said new text to be classified is segmented by using the method of segmentation at the ending in step S3.

6. The method as claimed in claim 5, wherein the final segmentation method is a full segmentation of the sentences in the new text to be classified to generate the text word set.

7. The method of claim 1, wherein in step S3, the text word set is divided into one or more parts by using a space as a stop sign to form the characters, the characters in each row of the new text to be classified are read, the frequency of occurrence of each character is calculated by the TF-IDF algorithm, and the eigenvector matrix is established.

8. The method as claimed in claim 1, wherein in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, the text classification model is generated by invoking the training algorithm of the text classifier, and the classification result of the text to be classified is outputted.

9. The method as claimed in claim 1, wherein the classification result is the accuracy and TopN ranking of the classification of the text to be classified.

10. The method of claim 1, wherein the text classifier is a support vector machine classifier.