CN111309904A - Public data classification method based on generalized characteristic word stock - Google Patents

Public data classification method based on generalized characteristic word stock Download PDF

Info

Publication number
CN111309904A
CN111309904A CN202010066137.2A CN202010066137A CN111309904A CN 111309904 A CN111309904 A CN 111309904A CN 202010066137 A CN202010066137 A CN 202010066137A CN 111309904 A CN111309904 A CN 111309904A
Authority
CN
China
Prior art keywords
text
classified
data
word
generalized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010066137.2A
Other languages
Chinese (zh)
Inventor
陈磊
刘迎风
储昭武
管红
潘佳
唐若培
徐洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Big Data Center
Original Assignee
Shanghai Big Data Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Big Data Center filed Critical Shanghai Big Data Center
Priority to CN202010066137.2A priority Critical patent/CN111309904A/en
Publication of CN111309904A publication Critical patent/CN111309904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock, which comprises the following steps: step S1, establishing a generalized characteristic word stock; step S2, preprocessing a text to be classified, calling the generalized feature lexicon to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified; step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm; and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded. The invention can greatly improve the efficiency, speed and accuracy of public data classification.

Description

Public data classification method based on generalized characteristic word stock
Technical Field
The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock.
Background
With the advance of urban digital transformation and centralized and unified management of public data, the problem of classification and classification of public data is urgently needed to be solved, and particularly, the safety classification of a public data directory is used for determining which data can be shared and opened unconditionally, and which data are applicable to conditional sharing and opening or unopened and not shared according to personal privacy, core business confidentiality or related law and regulation, so that data authorization and sharing opening are carried out by combining different application scenes, and energized urban management of data and formation of data operation ecology are realized. In the classification process of public data, currently, manual classification is mainly performed by means of knowledge background and related reference regulations of professionals, the manual classification mode depends on the capability of workers, and the workload is huge and the efficiency is low.
Therefore, the text classification technology based on the natural language processing field in the artificial intelligence is provided, the efficiency and the speed of public data classification can be greatly improved, and meanwhile, the classification accuracy is improved.
At present, the implementation methods related to text classification technologies are mainly classified into statistical learning methods and deep learning methods. The former mainly uses a feature selection method as a main part, selects word and sentence level features of a text through indexes such as TF-IDF (term frequency-inverse document frequency, a commonly used weighting technology for information retrieval data mining), PMI (Purchasing Managers' Index), chi-square value and the like to obtain a feature vector representing the text, and obtains the probability of each label of the feature vector by using a machine learning method to serve as a final classification standard; the latter takes model construction as the main, takes discrete information of the text as input, and updates the network weight through the serial and parallel structure of the multilayer neural network and the back propagation algorithm, so as to directly obtain the probability of the text on each label.
However, in the public data classification, it is necessary to extract not only the feature words in the classified data description but also the feature words in the related legal provision and appropriately increase the weight of these feature words. Thus, a text classification method is proposed herein that is specifically used to rank common data.
Disclosure of Invention
In order to solve the technical problems, the invention provides a public data classification method based on a generalized characteristic word bank.
The technical problem solved by the invention can be realized by adopting the following technical scheme:
a public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
Preferably, the step S1 includes:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
Preferably, in step S10, a large amount of text data is obtained from public data as the corpus, and N values are preset to remove the homogeneous data in the corpus.
Preferably, the step S2 of preprocessing the text to be classified includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
Preferably, in step S3, a word segmentation method is used to segment the new text to be classified.
Preferably, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate the text word set.
Preferably, in step S3, the text word set is segmented one by one with spaces as stoppers to form the characters, the characters of each line in the new text to be classified are read, the frequency of occurrence of each character is calculated through the TF-IDF algorithm, and the feature vector matrix is established.
Preferably, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-classification label is converted into another input vector of the text classifier, the text classification model is generated by invoking the text classifier training algorithm, and the classification result of the text to be classified is output.
Preferably, the classification result is the accuracy and TopN ranking of the text classification to be classified.
Preferably, the text classifier is a support vector machine classifier.
The beneficial effects are that:
the public data classification method based on the generalized characteristic lexicon can greatly improve the efficiency, speed and accuracy of public data classification.
Drawings
FIG. 1 is a diagram of steps of a public data classification method based on a generalized characteristic lexicon according to the present invention;
fig. 2 is a flowchart illustrating an embodiment of step S1 in fig. 1.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
Referring to fig. 1, a step diagram of a public data classification method based on a generalized feature thesaurus provided by the present invention includes:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
Referring to fig. 2, a flowchart of an embodiment of step S1 in fig. 1 includes:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
Further, in step S10, a large amount of text data is obtained from the public data as a corpus, and the predetermined N value is used to remove the homogeneous data in the corpus.
Specifically, presetting an N value, sequencing data in the corpus through a TopN algorithm, extracting the first N data in the sequenced corpus, storing the data in a document, packaging the document, compiling codes and packaging the document, and further generating a generalized characteristic word bank; the generalized characteristic word bank provided by the invention can process the characteristic word bank of non-limited text data, can process various text data, filters out some conventional non-specific words, and can accelerate the speed of classifying the text to be classified by processing the text to be detected by using the generalized characteristic word bank.
Further, the preprocessing the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
Specifically, firstly, preprocessing is performed on the text to be classified, namely, the preprocessing includes removing sensitive words, messy codes, punctuation marks and the like, so as to remove redundant parts in the text to be classified, and the text to be classified is subjected to regular matching through the generalized feature word library, so that the text to be classified can be further filtered, the multi-classification label of the text to be classified and the new text to be classified can be obtained, and a guarantee is provided for establishing a text classification model.
Further, in step S3, a word segmentation method is used to segment the new text to be classified.
Further, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate a text word set.
Specifically, a word segmentation method can be used for segmenting the new text to be classified; the method comprises the steps of carrying out full segmentation on sentences in a new text to be classified to generate a word graph represented by an adjacent linked list, namely a text word set, then carrying out one-to-one segmentation on the text word set by taking a blank as a stop sign, namely adding the blank in the middle of a word of the text word set, reading characters in each line of the new text to be classified by establishing a TF-IDF structure, calculating frequency of each character and establishing a feature vector matrix.
Further, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, a text classification model is generated by invoking a text classifier training algorithm, and a classification result of the text to be classified is output.
Further, the grading result is the accuracy rate and the TopN ordering of the text classification to be classified.
Further, the text classifier is a support vector machine classifier.
Specifically, the feature vector matrix is converted into a vector x, the multi-classification label is converted into a vector y, the vector x and the vector y are input into the support vector machine classifier as input vectors by calling a mode identification and regression software package and an svm _ train (y, x) training algorithm in the support vector machine classifier, a text classification model is further generated, and the accuracy of text classification to be classified and a TopN ranking list are obtained. By using the support vector machine classifier, a digital optimization algorithm is not needed, and the storage of a matrix is not needed, so that the text classification efficiency is improved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. A public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
2. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said step S1 comprises:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
3. The method as claimed in claim 1, wherein in step S10, a large amount of text data is obtained from public data as the corpus, and N is preset to remove the homogeneous data in the corpus.
4. The method as claimed in claim 1, wherein the preprocessing of the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
5. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said new text to be classified is segmented by using the method of segmentation at the ending in step S3.
6. The method as claimed in claim 5, wherein the final segmentation method is a full segmentation of the sentences in the new text to be classified to generate the text word set.
7. The method of claim 1, wherein in step S3, the text word set is divided into one or more parts by using a space as a stop sign to form the characters, the characters in each row of the new text to be classified are read, the frequency of occurrence of each character is calculated by the TF-IDF algorithm, and the eigenvector matrix is established.
8. The method as claimed in claim 1, wherein in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, the text classification model is generated by invoking the training algorithm of the text classifier, and the classification result of the text to be classified is outputted.
9. The method as claimed in claim 1, wherein the classification result is the accuracy and TopN ranking of the classification of the text to be classified.
10. The method of claim 1, wherein the text classifier is a support vector machine classifier.
CN202010066137.2A 2020-01-20 2020-01-20 Public data classification method based on generalized characteristic word stock Pending CN111309904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066137.2A CN111309904A (en) 2020-01-20 2020-01-20 Public data classification method based on generalized characteristic word stock

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066137.2A CN111309904A (en) 2020-01-20 2020-01-20 Public data classification method based on generalized characteristic word stock

Publications (1)

Publication Number Publication Date
CN111309904A true CN111309904A (en) 2020-06-19

Family

ID=71156399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066137.2A Pending CN111309904A (en) 2020-01-20 2020-01-20 Public data classification method based on generalized characteristic word stock

Country Status (1)

Country Link
CN (1) CN111309904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257425A (en) * 2020-09-29 2021-01-22 国网天津市电力公司 Power data analysis method and system based on data classification model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257425A (en) * 2020-09-29 2021-01-22 国网天津市电力公司 Power data analysis method and system based on data classification model

Similar Documents

Publication Publication Date Title
CN108304468B (en) Text classification method and text classification device
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN112632980A (en) Enterprise classification method and system based on big data deep learning and electronic equipment
CN113961685A (en) Information extraction method and device
CN110705265A (en) Contract clause risk identification method and device
CN106294568A (en) A kind of Chinese Text Categorization rule generating method based on BP network and system
CN110795525A (en) Text structuring method and device, electronic equipment and computer readable storage medium
Curtotti et al. Corpus based classification of text in Australian contracts
CN112257425A (en) Power data analysis method and system based on data classification model
CN107862051A (en) A kind of file classifying method, system and a kind of document classification equipment
US11314922B1 (en) System and method for generating regulatory content requirement descriptions
Ransing et al. Screening and Ranking Resumes using Stacked Model
Andriyanov Combining Text and Image Analysis Methods for Solving Multimodal Classification Problems
CN113902569A (en) Method for identifying the proportion of green assets in digital assets and related products
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN117291192A (en) Government affair text semantic understanding analysis method and system
CN111309904A (en) Public data classification method based on generalized characteristic word stock
Kumar et al. Transformer-based Models for Language Identification: A Comparative Study
CN112749530A (en) Text encoding method, device, equipment and computer readable storage medium
Dubey et al. Sentiment analysis of keenly intellective smart phone product review utilizing SVM classification technique
CN116226747A (en) Training method of data classification model, data classification method and electronic equipment
CN111191455A (en) Legal provision prediction method in traffic accident damage compensation
CN112487211B (en) Rail transit knowledge base construction method and system
CN115033699A (en) Fund user classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619