CN111309904A - Public data classification method based on generalized characteristic word stock - Google Patents
Public data classification method based on generalized characteristic word stock Download PDFInfo
- Publication number
- CN111309904A CN111309904A CN202010066137.2A CN202010066137A CN111309904A CN 111309904 A CN111309904 A CN 111309904A CN 202010066137 A CN202010066137 A CN 202010066137A CN 111309904 A CN111309904 A CN 111309904A
- Authority
- CN
- China
- Prior art keywords
- text
- classified
- data
- word
- generalized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 27
- 230000011218 segmentation Effects 0.000 claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000000265 homogenisation Methods 0.000 claims description 6
- 238000012706 support-vector machine Methods 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 5
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000004806 packaging method and process Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock, which comprises the following steps: step S1, establishing a generalized characteristic word stock; step S2, preprocessing a text to be classified, calling the generalized feature lexicon to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified; step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm; and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded. The invention can greatly improve the efficiency, speed and accuracy of public data classification.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a public data grading method based on a generalized characteristic word stock.
Background
With the advance of urban digital transformation and centralized and unified management of public data, the problem of classification and classification of public data is urgently needed to be solved, and particularly, the safety classification of a public data directory is used for determining which data can be shared and opened unconditionally, and which data are applicable to conditional sharing and opening or unopened and not shared according to personal privacy, core business confidentiality or related law and regulation, so that data authorization and sharing opening are carried out by combining different application scenes, and energized urban management of data and formation of data operation ecology are realized. In the classification process of public data, currently, manual classification is mainly performed by means of knowledge background and related reference regulations of professionals, the manual classification mode depends on the capability of workers, and the workload is huge and the efficiency is low.
Therefore, the text classification technology based on the natural language processing field in the artificial intelligence is provided, the efficiency and the speed of public data classification can be greatly improved, and meanwhile, the classification accuracy is improved.
At present, the implementation methods related to text classification technologies are mainly classified into statistical learning methods and deep learning methods. The former mainly uses a feature selection method as a main part, selects word and sentence level features of a text through indexes such as TF-IDF (term frequency-inverse document frequency, a commonly used weighting technology for information retrieval data mining), PMI (Purchasing Managers' Index), chi-square value and the like to obtain a feature vector representing the text, and obtains the probability of each label of the feature vector by using a machine learning method to serve as a final classification standard; the latter takes model construction as the main, takes discrete information of the text as input, and updates the network weight through the serial and parallel structure of the multilayer neural network and the back propagation algorithm, so as to directly obtain the probability of the text on each label.
However, in the public data classification, it is necessary to extract not only the feature words in the classified data description but also the feature words in the related legal provision and appropriately increase the weight of these feature words. Thus, a text classification method is proposed herein that is specifically used to rank common data.
Disclosure of Invention
In order to solve the technical problems, the invention provides a public data classification method based on a generalized characteristic word bank.
The technical problem solved by the invention can be realized by adopting the following technical scheme:
a public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
Preferably, the step S1 includes:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
Preferably, in step S10, a large amount of text data is obtained from public data as the corpus, and N values are preset to remove the homogeneous data in the corpus.
Preferably, the step S2 of preprocessing the text to be classified includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
Preferably, in step S3, a word segmentation method is used to segment the new text to be classified.
Preferably, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate the text word set.
Preferably, in step S3, the text word set is segmented one by one with spaces as stoppers to form the characters, the characters of each line in the new text to be classified are read, the frequency of occurrence of each character is calculated through the TF-IDF algorithm, and the feature vector matrix is established.
Preferably, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-classification label is converted into another input vector of the text classifier, the text classification model is generated by invoking the text classifier training algorithm, and the classification result of the text to be classified is output.
Preferably, the classification result is the accuracy and TopN ranking of the text classification to be classified.
Preferably, the text classifier is a support vector machine classifier.
The beneficial effects are that:
the public data classification method based on the generalized characteristic lexicon can greatly improve the efficiency, speed and accuracy of public data classification.
Drawings
FIG. 1 is a diagram of steps of a public data classification method based on a generalized characteristic lexicon according to the present invention;
fig. 2 is a flowchart illustrating an embodiment of step S1 in fig. 1.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
Referring to fig. 1, a step diagram of a public data classification method based on a generalized feature thesaurus provided by the present invention includes:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
Referring to fig. 2, a flowchart of an embodiment of step S1 in fig. 1 includes:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
Further, in step S10, a large amount of text data is obtained from the public data as a corpus, and the predetermined N value is used to remove the homogeneous data in the corpus.
Specifically, presetting an N value, sequencing data in the corpus through a TopN algorithm, extracting the first N data in the sequenced corpus, storing the data in a document, packaging the document, compiling codes and packaging the document, and further generating a generalized characteristic word bank; the generalized characteristic word bank provided by the invention can process the characteristic word bank of non-limited text data, can process various text data, filters out some conventional non-specific words, and can accelerate the speed of classifying the text to be classified by processing the text to be detected by using the generalized characteristic word bank.
Further, the preprocessing the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
Specifically, firstly, preprocessing is performed on the text to be classified, namely, the preprocessing includes removing sensitive words, messy codes, punctuation marks and the like, so as to remove redundant parts in the text to be classified, and the text to be classified is subjected to regular matching through the generalized feature word library, so that the text to be classified can be further filtered, the multi-classification label of the text to be classified and the new text to be classified can be obtained, and a guarantee is provided for establishing a text classification model.
Further, in step S3, a word segmentation method is used to segment the new text to be classified.
Further, the ending word segmentation method is to perform full segmentation on the sentences in the new text to be classified to generate a text word set.
Specifically, a word segmentation method can be used for segmenting the new text to be classified; the method comprises the steps of carrying out full segmentation on sentences in a new text to be classified to generate a word graph represented by an adjacent linked list, namely a text word set, then carrying out one-to-one segmentation on the text word set by taking a blank as a stop sign, namely adding the blank in the middle of a word of the text word set, reading characters in each line of the new text to be classified by establishing a TF-IDF structure, calculating frequency of each character and establishing a feature vector matrix.
Further, in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, a text classification model is generated by invoking a text classifier training algorithm, and a classification result of the text to be classified is output.
Further, the grading result is the accuracy rate and the TopN ordering of the text classification to be classified.
Further, the text classifier is a support vector machine classifier.
Specifically, the feature vector matrix is converted into a vector x, the multi-classification label is converted into a vector y, the vector x and the vector y are input into the support vector machine classifier as input vectors by calling a mode identification and regression software package and an svm _ train (y, x) training algorithm in the support vector machine classifier, a text classification model is further generated, and the accuracy of text classification to be classified and a TopN ranking list are obtained. By using the support vector machine classifier, a digital optimization algorithm is not needed, and the storage of a matrix is not needed, so that the text classification efficiency is improved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (10)
1. A public data classification method based on a generalized characteristic word stock is characterized by comprising the following steps:
step S1, establishing a generalized characteristic word stock;
step S2, preprocessing the text to be classified through the generalized characteristic word library, calling the generalized characteristic word library to perform regular matching on the preprocessed text to be classified, generating a new text to be classified, and obtaining a multi-level label of the new text to be classified;
step S3, performing word segmentation on the new text to be classified by using a word segmentation tool to obtain a text word set; performing word segmentation processing on the text word set to form characters, and converting the characters into a characteristic vector matrix through a TF-IDF algorithm;
and step S4, inputting the characteristic vector matrix into a text classifier, generating a text grading model, and outputting a grading result of the text to be graded.
2. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said step S1 comprises:
step S10, obtaining a corpus, and eliminating homogenization data in the corpus;
step S11, classifying the corpora after the homogenization data is eliminated;
step S12, sorting the classified data in the corpus;
step S13, extracting the first N data in the sorted linguistic data, and storing the first N data in a document, wherein N is larger than 1;
and step S14, processing the document to establish the generalized characteristic word stock.
3. The method as claimed in claim 1, wherein in step S10, a large amount of text data is obtained from public data as the corpus, and N is preset to remove the homogeneous data in the corpus.
4. The method as claimed in claim 1, wherein the preprocessing of the text to be classified in step S2 includes: and removing sensitive words, messy codes and punctuation marks in the text to be classified so as to remove redundant parts in the text to be classified.
5. The method for classifying public data based on the lexicon of generalized features as claimed in claim 1, wherein said new text to be classified is segmented by using the method of segmentation at the ending in step S3.
6. The method as claimed in claim 5, wherein the final segmentation method is a full segmentation of the sentences in the new text to be classified to generate the text word set.
7. The method of claim 1, wherein in step S3, the text word set is divided into one or more parts by using a space as a stop sign to form the characters, the characters in each row of the new text to be classified are read, the frequency of occurrence of each character is calculated by the TF-IDF algorithm, and the eigenvector matrix is established.
8. The method as claimed in claim 1, wherein in step S4, the feature vector matrix is converted into one input vector of the text classifier, the multi-class label is converted into another input vector of the text classifier, the text classification model is generated by invoking the training algorithm of the text classifier, and the classification result of the text to be classified is outputted.
9. The method as claimed in claim 1, wherein the classification result is the accuracy and TopN ranking of the classification of the text to be classified.
10. The method of claim 1, wherein the text classifier is a support vector machine classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010066137.2A CN111309904A (en) | 2020-01-20 | 2020-01-20 | Public data classification method based on generalized characteristic word stock |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010066137.2A CN111309904A (en) | 2020-01-20 | 2020-01-20 | Public data classification method based on generalized characteristic word stock |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111309904A true CN111309904A (en) | 2020-06-19 |
Family
ID=71156399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010066137.2A Pending CN111309904A (en) | 2020-01-20 | 2020-01-20 | Public data classification method based on generalized characteristic word stock |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111309904A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257425A (en) * | 2020-09-29 | 2021-01-22 | 国网天津市电力公司 | Power data analysis method and system based on data classification model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN108520030A (en) * | 2018-03-27 | 2018-09-11 | 深圳中兴网信科技有限公司 | File classification method, Text Classification System and computer installation |
-
2020
- 2020-01-20 CN CN202010066137.2A patent/CN111309904A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN108520030A (en) * | 2018-03-27 | 2018-09-11 | 深圳中兴网信科技有限公司 | File classification method, Text Classification System and computer installation |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257425A (en) * | 2020-09-29 | 2021-01-22 | 国网天津市电力公司 | Power data analysis method and system based on data classification model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304468B (en) | Text classification method and text classification device | |
CN112434535B (en) | Element extraction method, device, equipment and storage medium based on multiple models | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN112632980A (en) | Enterprise classification method and system based on big data deep learning and electronic equipment | |
CN113961685A (en) | Information extraction method and device | |
CN110705265A (en) | Contract clause risk identification method and device | |
CN106294568A (en) | A kind of Chinese Text Categorization rule generating method based on BP network and system | |
CN110795525A (en) | Text structuring method and device, electronic equipment and computer readable storage medium | |
Curtotti et al. | Corpus based classification of text in Australian contracts | |
CN112257425A (en) | Power data analysis method and system based on data classification model | |
CN107862051A (en) | A kind of file classifying method, system and a kind of document classification equipment | |
US11314922B1 (en) | System and method for generating regulatory content requirement descriptions | |
Ransing et al. | Screening and Ranking Resumes using Stacked Model | |
Andriyanov | Combining Text and Image Analysis Methods for Solving Multimodal Classification Problems | |
CN113902569A (en) | Method for identifying the proportion of green assets in digital assets and related products | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN117291192A (en) | Government affair text semantic understanding analysis method and system | |
CN111309904A (en) | Public data classification method based on generalized characteristic word stock | |
Kumar et al. | Transformer-based Models for Language Identification: A Comparative Study | |
CN112749530A (en) | Text encoding method, device, equipment and computer readable storage medium | |
Dubey et al. | Sentiment analysis of keenly intellective smart phone product review utilizing SVM classification technique | |
CN116226747A (en) | Training method of data classification model, data classification method and electronic equipment | |
CN111191455A (en) | Legal provision prediction method in traffic accident damage compensation | |
CN112487211B (en) | Rail transit knowledge base construction method and system | |
CN115033699A (en) | Fund user classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200619 |