CN113051462A - Multi-classification model training method, system and device - Google Patents

Multi-classification model training method, system and device Download PDF

Info

Publication number
CN113051462A
CN113051462A CN201911363343.3A CN201911363343A CN113051462A CN 113051462 A CN113051462 A CN 113051462A CN 201911363343 A CN201911363343 A CN 201911363343A CN 113051462 A CN113051462 A CN 113051462A
Authority
CN
China
Prior art keywords
text data
data
text
training
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911363343.3A
Other languages
Chinese (zh)
Inventor
张剑
骆起峰
程刚
王昕�
刘轶
黄石磊
杨大明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN201911363343.3A priority Critical patent/CN113051462A/en
Publication of CN113051462A publication Critical patent/CN113051462A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-classification model training method, a system and a device. The method comprises a data preprocessing step, a classification step and a clustering step; the data preprocessing step is used for carrying out primary processing and balanced processing on the collected news data, dividing the news data into a training set and a testing set, carrying out word segmentation processing and constructing the news data into a format with category labels; in the classification step, a text classification model is constructed, and text data in a training set are trained to obtain a classifier; testing each testing subset by using a classifier, and screening out companies with testing accuracy smaller than a threshold value to construct a list; and in the clustering step, the text data of each company in the list is found out and converted into vectors, the vectors are clustered, and a binary classification model is obtained through training. Compared with the traditional machine learning method, the scheme of the invention can make the classification effect better and better; compared with the depth model method, the method has lower dependence on data than the prior art.

Description

Multi-classification model training method, system and device
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-classification model training method, a multi-classification model training system and a multi-classification model training device.
Background
With the rapid development of the internet, people have entered the information age, and networks are becoming an indispensable channel for people to understand the outside world. At present, people want to know the operation condition of a company mainly through related news on the internet, but the internet is full of massive news, and how to find out the related news of the company which the people want to pay attention to from the massive news is a great problem.
The multi-classification of the company news public opinion relevance is to classify mass news and screen out the news of related companies interested by people, so that the production efficiency is effectively improved. The multi-classification of the company news public opinion relevance is essentially a text classification problem and is a core subproblem of natural language processing. The text classification mainly comprises the aspects of text preprocessing, feature extraction, text expression, a classifier and the like. The text classification is widely applied to various fields, such as webpage classification, microblog sentiment analysis, user comment mining, information retrieval and Web document automatic classification, but the method for applying the text classification to the news and public opinion relevancy of companies is few.
Currently, in the research on text classification, there are mainly the following two methods. The method is based on a traditional machine learning method, and a main method can be summarized into a feature engineering and shallow classification model. The characteristic engineering stage is mainly a process of converting data provided for a machine into information, and specific problems are specifically analyzed due to different requirements of different tasks on characteristics; the classifier is mainly based on a statistical classification method, and basically most of classifier learning methods are applied to the field of text classification, such as Support Vector Machines (SVMs) and naive bayesian classification algorithms. The other is a method based on deep learning text classification, which uses Network structures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to automatically obtain feature expressions, and then classifies the feature expressions, thereby solving the problem end to end. The current text classification evaluation indexes mainly have accuracy and recall rate. The accuracy rate is the direct proportion of the correct classification of the classifier to the number of all the classifications; and the recall ratio is the ratio of the number of correctly classified samples to the number of all correct samples by the classifier. Still others are evaluated by F1 values, which are weighted averages of accuracy and recall.
Practice has found that the above-mentioned machine learning-based method and depth model-based method have drawbacks.
The traditional machine learning method does not need training data, but because different tasks have different feature requirements, relevant features need to be designed for text classification of different tasks, so the traditional machine learning method is relatively complex to operate, and simultaneously has great skill and performance which varies from person to person.
The accuracy of the method based on the deep model is directly influenced by the quantity and quality of training data, deep learning is a data-driven method, a large amount of high-quality training data can effectively improve the classification accuracy, but the acquisition of the large amount of high-quality training data requires a large amount of manpower and material resources, and the construction of the large amount of high-quality training data can be completed at a high cost.
Disclosure of Invention
The embodiment of the invention aims to provide a multi-classification model training method which is used for training to obtain a multi-classification model, and the multi-classification model is used for facing increasing news data and realizing high-quality and high-efficiency news classification. The embodiment of the invention also aims to provide a corresponding system and a corresponding device.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, a multi-classification model training method is provided, including:
a data preprocessing step: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, and constructing each piece of text data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
and (3) classification step: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to class labels, testing the classification of each test subset by using a final classifier, and screening companies with test accuracy smaller than a threshold value to construct a list;
clustering: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
With reference to the first aspect, in a possible implementation manner, the data preprocessing step specifically includes:
collecting news data relating to different companies;
extracting keywords and related news contents from each piece of collected news data, storing the keywords and the related news contents as the text data, and removing non-text parts in the text data, wherein the keywords comprise company identification information;
carrying out equalization processing on the obtained text data based on the company identification information, and dividing the text data obtained after the equalization processing into two subsets without intersection to be respectively used as a training set and a test set;
constructing a stop word list and a special dictionary, combining a jieba Chinese word segmentation algorithm, segmenting each piece of text data in a training set and a test set, and intercepting the text data with the length exceeding a preset value;
and constructing each piece of data after word segmentation into a format with a category label, wherein different category labels represent different companies so as to be suitable for a fast text classification algorithm fasttext.
With reference to the first aspect, in a possible implementation manner, the classifying step specifically includes:
constructing a text classification model by using fasttext, inputting text data in a training set into the text classification model, and training to obtain an initial classifier; classifying and testing each piece of text data in the test set by using the initial classifier, and judging the test accuracy; training to obtain a final classifier by repeatedly adjusting parameters of the text classification model;
dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the category of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list.
With reference to the first aspect, in a possible implementation manner, the clustering specifically includes:
finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, performing word segmentation, performing word frequency statistics, and constructing a keyword dictionary of each company by combining manual intervention;
converting each piece of text data after word segmentation in the last step into a multi-dimensional vector, clustering the multi-dimensional vectors related to each company, carrying out word frequency statistics according to keywords, dividing a plurality of clusters obtained by clustering into two samples, wherein the two samples are a positive sample and a negative sample, and training to obtain a binary classification model.
In a second aspect, a multi-classification model training system is provided, including:
a data pre-processing module to: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, and constructing each piece of data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
a classification module to: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to the class labels, testing the classification of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list;
a clustering module to: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
With reference to the second aspect, in a possible implementation manner, the data preprocessing module is specifically configured to:
collecting news data relating to different companies;
extracting keywords and related news contents from each piece of collected news data, storing the keywords and the related news contents as the text data, and removing non-text parts in the text data, wherein the keywords comprise company identification information;
carrying out equalization processing on the obtained text data based on the company identification information, and dividing the text data obtained after the equalization processing into two subsets without intersection to be respectively used as a training set and a test set;
constructing a stop word list and a special dictionary, combining a jieba Chinese word segmentation algorithm, segmenting each piece of text data in a training set and a test set, and intercepting the text data with the length exceeding a preset value;
and constructing each piece of data after word segmentation into a format with a category label, wherein different category labels represent different companies so as to be suitable for a fast text classification algorithm fasttext.
With reference to the second aspect, in a possible implementation manner, the classification module is specifically configured to:
constructing a text classification model by using fasttext, inputting text data in a training set into the text classification model, and training to obtain an initial classifier; classifying and testing each piece of text data in the test set by using the initial classifier, and judging the test accuracy; training to obtain a final classifier by repeatedly adjusting parameters of the text classification model;
dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the category of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list.
With reference to the second aspect, in a possible implementation manner, the clustering module is specifically configured to:
finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, performing word segmentation, performing word frequency statistics, and constructing a keyword dictionary of each company by combining manual intervention;
converting each piece of text data after word segmentation in the last step into a multi-dimensional vector, clustering the multi-dimensional vectors related to each company, carrying out word frequency statistics according to keywords, dividing a plurality of clusters obtained by clustering into two samples, wherein the two samples are a positive sample and a negative sample, and training to obtain a binary classification model.
In a third aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores a program, and the program includes computer executable instructions, and when the computer device runs, the processor executes the computer executable instructions stored in the memory, so as to cause the computer device to execute the multi-classification model training method according to the first aspect.
In a fourth aspect, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising computer executable instructions, which when executed by a computer device, cause the computer device to perform the multi-classification model training method of the first aspect.
According to the technical scheme, the embodiment of the invention has the following advantages:
the multi-classification model obtained by the method training is suitable for the increasing news data, and high-quality and high-efficiency news classification can be realized. Compared with the traditional machine learning method, the scheme of the invention can make the classification effect better and better; compared with the depth model method, the method has lower dependence on data than the prior art.
Drawings
In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings used in the description of the embodiment will be briefly introduced below.
FIG. 1 is a flow chart of a multi-classification model training method according to an embodiment of the present invention;
FIG. 2 is a general framework schematic diagram of a multi-classification model provided by one embodiment of the invention;
FIG. 3 is a flow chart illustrating the steps of data preprocessing in one embodiment of the present invention;
FIG. 4 is a flow chart illustrating the classification step in one embodiment of the present invention;
FIG. 5 is a schematic flow chart of the clustering step in one embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a multi-classification model training system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
The terms "comprising" and "having" and any variations thereof in the description and claims of the invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
The following will explain details by way of specific examples.
The embodiment of the invention provides a multi-classification model training method for the association degree of news and public opinions of different subjects, which is used for training to obtain a multi-classification model (hereinafter, simply referred to as a multi-classification model) for the association degree of the news and public opinions of different subjects. The main body may be a variety of organizations or units such as companies and schools, and the company is exemplified herein. The multi-classification model mainly comprises three parts, including: a data preprocessing section; a classification section; and (6) clustering the parts. The data preprocessing part is mainly used for collecting and preprocessing news data, and comprises word segmentation of the data. The classification part mainly classifies the news data, labels a category label on the news data, and each category label represents that the news is related to which company. The clustering part is mainly used for clustering and grouping the classified news data according to the classification result.
Referring to fig. 1 and 2, an embodiment of the present invention provides a method for training a multi-classification model of a company news public opinion relevance, the method comprising:
data preprocessing step S1: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, constructing each piece of data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
classification step S2: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to the class labels, testing the classification of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list;
clustering step S3: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
In some embodiments, the multi-classification model trained by the method may include the following modules: the data acquisition module is used for acquiring news data; a pre-processing module for pre-processing the data, such as preliminary processing and equalization processing as described above; the word segmentation module is used for performing Chinese word segmentation on the text data obtained after preprocessing; a classification module for classifying the text data; and the clustering module is used for clustering.
The three steps are described in further detail below with reference to fig. 3 to 5.
And (I) a data preprocessing step. As shown in fig. 3, this step may include the following sub-steps:
A) raw news data is collected from the internet using a web crawler or like tool. The original news data is mostly in the format of bson (binary verified Document format) and is composed of keywords and related news contents. In this step, first, key information related to our text classification in the BSON format, including keywords (e.g., company identification information such as company stock codes) and related news content, is extracted, and the extracted content is stored as a text file, for example, in a TXT format file, so as to obtain text-formatted news data, i.e., text data.
B) For news data in text format such as TXT format, the step of removing the non-text portion in the content of the text data mainly includes the following portions: HTML tags, Chinese and English punctuation marks, messy codes in the process of crawling worms, numbers, English characters and the like, and then, the processed text data set can be stored into a sum.
C) The number of news related to each company has a problem of sample imbalance, for example, tens of thousands of news related to some companies and only a few news related to some companies exist, and in order to solve the problem of data sample imbalance, in this step, text data obtained after preliminary processing in steps a) and B) is subjected to equalization processing. The equalization process may be, for example, the selection of training and test data based on a scale. Optionally, the main rules of the equalization process include: for a company with a news score greater than a certain number, such as 500, only 500 articles are selected, the selection criteria being based on: the news title contains identification information of related company name, abbreviation or stock code, or the news with the identification information of related company name, abbreviation or stock code repeated more than three times. Meanwhile, for the companies with the number of the related news less than 500, the following rules can be adopted: firstly, all news data are filtered by the two rules of firstly, secondly, all the rest news data are selected. In this step, the text data obtained after the screening in the equalization step is divided into two subsets, which are respectively used as a training set and a test set. Optionally, the training set and the test set are two subsets without intersection, and the division criterion may be 7:3, for example, 70% of all screened data is used as the training set, and the other 30% is used as the test set.
D) In the step, word segmentation processing is respectively carried out on each piece of text data in the two subsets. Some implementations include: a dictionary special for company news and public opinions is constructed, and the dictionary mainly comprises company names such as A stock short name, B stock short name, stock codes and the like, so that the names of the special companies are not divided into different words in the Chinese word segmentation process. Meanwhile, a word frequency is counted for all news, then the words considered to be available are proposed through manual intervention, and the rest words are stop words to obtain a stop word list. In this step, each piece of text data in the training set and the test set can be segmented by combining the constructed stop word list and the constructed special dictionary and combining the jieba Chinese segmentation algorithm. Optionally, the text data may be further processed on the basis of the divided words, for example, news with a news length lower than n1 characters is filtered, news with a news length greater than n2 characters is cut, where n1 and n2 are both positive integers, and n1 is smaller than n2, so as to ensure that the length of each piece of text data is between n1 and n 2. It should be noted that jieba is a Chinese word segmenter for natural language processing that is very popular and open source now, and is compatible in the computer programming language python.
E) In the step, the well-participled text data can be further processed and constructed into a format with class labels, and different class labels represent different companies and are used for training and testing a fast text classification algorithm fasttext or other text classification algorithms and the like. Alternatively, a text data of a training test may represent a format (__ label __ n word _1, word _2, word _3,.. word _ n), where n represents identification information of a company such as stock code, which is preceded by __ label __ prefix, thereby constructing a label, word _1,. wherein word _ n is a result after a word is separated from news data, and each word _ x (word _1,....., word _ n)) represents a chinese single word, where n is a positive integer and x is a positive integer.
And (II) a classification step.
Herein, news text data classification may be implemented using fasttext (fast text classification algorithm fasttext is a word vector calculation and text classification tool that is open source in 2016). In this step, a text classification model is established, which is used to classify the news text data, that is, to determine the category labels of the news text data, where each category label represents the news and which company is related.
As shown in fig. 4, the classification step may include the following sub-steps:
[2.1] the training phase may include:
A) for the training set xd ═ (t _1, t _2,. t _ i.), i is a positive integer, t _ i represents a text block after data preprocessing, t _ i ═ (label word _1word _2.. word _ j.. word _ n), j is a positive integer, label of the company, and word _ j represents a chinese word. The text data in the training set xd is input into a text classification model constructed by fasttext for training, and each training sample t _ i in the training set xd can be changed into w _ i ═ (d _1, d _2., d _ n) and y _ i ∈ Rdw through word vector learning and one-hot coding, d _ i ∈ Rdn represents the word vector of word _ i after word segmentation, which is a vector of dn dimension, y _ i is a vector of dw dimension, dw represents the number of companies to be classified, each dimension of the vector of dw dimension represents one company, and the criterion constructed by y _ i is to set that dimension of the corresponding company as one and set the rest as zero.
B) The processed training data w _ i ═ (d _1, d _2,. and d _ n) can be input into a shallow neural network constructed by a fasttext text classification model, and a final vector expression v (t _ i) ∈ Rdw of the news text data is obtained through forward propagation.
C) V (t _ i) can be input into the activation function softmax, resulting in a probability vector u (t _ i) e Rdw.
D) u (t _ i) and y _ i obtain a value through a cross entropy loss function, and the value can be used for adjusting parameters of the text classification model through a reverse propagation algorithm.
E) The above operations may be iterated until a set iteration threshold is reached.
F) Through the steps, the initial classifier c is obtained through training of the training set xd.
In the training stage, a text classification model is constructed by using fasttext, text data in a training set is input into the text classification model, and an initial classifier is obtained through training. The trained classifier is an important component of the text classification model constructed in the text.
[2.2] the testing phase may include:
A) for the test set cd ═ (t _1, t _2,. t _ q.), q is a positive integer, t _ q represents a text block after data preprocessing, t _ q ═ is (label word _1word _2.. word _ p.. word _ n), label of company is label, p is a positive integer, and word _ p represents a chinese word. The text data of the test set can be input into a text classification model constructed by fasttext, and each training sample t _ q in the test set cd is changed into w _ i ═ (d _1, d _2,. and d _ n) through word vector learning, wherein d _ i ∈ Rdn represents a word vector of word _ i after word segmentation, and the word vector is a vector with a dn dimension.
B) W _ i can be input into the classifier c to get a probability vector u (t _ i) ∈ Rdw.
C) The largest value can be selected from u (t _ i), and the corresponding company label is mapped to obtain label _ prediction.
D) The accuracy of the test can be judged by judging whether the classification label provided by the test set data and the label _ predict predicted by the classifier are consistent or not.
In the test stage, the initial classifier is used for carrying out classification test on each piece of text data in the test set, and the test accuracy is judged.
[2.3] the parameter adjusting process may include:
A) the classifier obtained by training can be improved by adjusting several sensitive parameters of the fasttext classification model, such as learning rate, loss function type, epoch and the like. In deep learning, 1 epoch is equal to one training using all samples in the training set.
B) Keeping the type of the loss function and the epoch parameter unchanged, training a classifier c _ i by setting different learning rates, and selecting a better learning rate parameter alpha by testing the accuracy of the classifier.
C) Setting the learning rate to alpha, ensuring that the loss function type parameter is unchanged, training a classifier c _ i by setting different epochs, and selecting a better epoch parameter n by testing the accuracy of the classifier.
D) Setting the learning rate as alpha and the epoch as n, training a classifier c _ i by setting different loss function types, and selecting a better loss function type x by testing the accuracy of the classifier.
E) And setting the learning rate as alpha, the epoch as n and the loss type as x, and training the final classifier c _ final.
In the parameter adjusting process, the final classifier is obtained through training by repeatedly adjusting parameters of the text classification model.
[2.4] the data screening stage may include:
A) the test set cd ═ (t _1, t _ 2.) is divided into different test subsets according to different companies, cd _1, cd _2, cd _ 3.., cd _ n, n represents the number of all companies to be classified.
B) And classifying the text data of each test subset by using the finally obtained classifier c _ final to obtain a prediction label, and obtaining the test accuracy p _1, p _2.
C) A threshold p may be set and companies with an accuracy less than p are screened out to construct a list L.
In the data screening stage, a company which is lower than a pre-manufactured company in the test accuracy of classification by adopting a classifier in the test set is screened out to construct a list L.
And (III) clustering.
Herein, the clustering step may perform clustering by using a kmeans (k-means clustering algorithm) method provided by the open-source skleran package or other clustering algorithms.
As shown in fig. 5, the method can be divided into the following steps:
[3.1] clustered data preparation phase
A) According to the screened list L, identification information of related companies such as stock codes is selected, each piece of text data of each company is selected from the total news text data which is subjected to preliminary processing in the preprocessing step and is stored as a text file, for example, the text file is stored in different TXT files.
B) The word segmentation can be carried out by the jieba on the basis of a special dictionary and a deactivation word list which are constructed in the data preprocessing stage for each text file.
C) The word frequency statistics can be carried out on each text file through related codes, a plurality of keywords can be manually selected from the statistical result, and a keyword dictionary of each company is constructed.
[3.2] clustering phase
A) The text data prepared in the previous stage is converted into a multi-dimensional vector, for example, 728-dimensional vector, by using a word vector training model, for example, a bert language model proposed by google corporation, and each piece of news text data of each company.
B) The multidimensional vectors associated with each company are clustered by means of Kmeans into several clusters, for example, 5 large clusters.
C) According to the occurrence frequency of the keywords, statistics is carried out, and each cluster of the clusters can be divided into two types, wherein one type is used as a positive sample, and the other type is used as a negative sample.
D) And training a binary classification model based on the clustering result, and testing to obtain the required binary classification model. The binary classification model obtained by training is an important component of the text classification model constructed in the text.
E) Companies that do not meet the classification performance requirements may be further processed.
As mentioned above, the multi-classification model training is completed through three steps of data preprocessing, classification and clustering.
The multi-classification model obtained by the method training is suitable for the increasing news data, and high-quality and high-efficiency news classification can be realized. Compared with the traditional machine learning method, the scheme of the invention can make the classification effect better and better; compared with the depth model method, the method has lower dependence on data than the prior art.
Referring to fig. 6, an embodiment of the present invention further provides a multi-classification model training system, which includes:
a data preprocessing module 61 for: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, and constructing each piece of data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
a classification module 62 for: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to the class labels, testing the classification of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list;
a clustering module 63 configured to: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
In some embodiments, the data preprocessing module 61 includes sub-modules such as a data acquisition module, a preprocessing module, and a word segmentation module; the data preprocessing module 61 may be specifically configured to:
collecting news data related to different companies by using a data acquisition module;
extracting keywords and related news contents from each piece of collected news data by using a pre-processing module, storing the keywords and the related news contents as text data, and removing non-text parts in the text data, wherein the keywords comprise company identification information; carrying out equalization processing on the obtained text data based on the company identification information, and dividing the text data obtained after the equalization processing into two subsets without intersection to be respectively used as a training set and a test set;
constructing a stop word list and a special dictionary by utilizing a Chinese word segmentation module, segmenting each piece of text data in a training set and a test set by combining a jieba Chinese word segmentation algorithm, and intercepting the text data with the length exceeding a preset value; and constructing each piece of data after word segmentation into a format with a category label, wherein different category labels represent different companies so as to be suitable for a fast text classification algorithm fasttext.
In some embodiments, the classification module 62 is specifically configured to:
constructing a text classification model by using fasttext, inputting text data in a training set into the text classification model, and training to obtain an initial classifier; classifying and testing each piece of text data in the test set by using the initial classifier, and judging the test accuracy; training to obtain a final classifier by repeatedly adjusting parameters of the text classification model;
dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the category of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list.
In some embodiments, the clustering module 63 is specifically configured to:
finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, performing word segmentation, performing word frequency statistics, and constructing a keyword dictionary of each company by combining manual intervention;
converting each piece of text data after word segmentation in the last step into a multi-dimensional vector, clustering the multi-dimensional vectors related to each company, carrying out word frequency statistics according to keywords, dividing a plurality of clusters obtained by clustering into two samples, wherein the two samples are a positive sample and a negative sample, and training to obtain a binary classification model.
Referring to fig. 7, an embodiment of the present invention further provides a computer device 70, including a processor 71 and a memory 72, where the memory 72 stores a program, and the program includes computer execution instructions, and when the computer device 70 runs, the processor 71 executes the computer execution instructions stored in the memory 72, so as to enable the computer device 70 to execute the multi-classification model training method as described above.
Embodiments of the present invention also provide a computer readable storage medium storing one or more programs, the one or more programs comprising computer executable instructions, which when executed by a computer device, cause the computer device to perform a multi-classification model training method as described above.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; those of ordinary skill in the art will understand that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A multi-classification model training method is characterized by comprising the following steps:
a data preprocessing step: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, and constructing each piece of text data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
and (3) classification step: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the classification of each test subset by using a final classifier, and screening companies with the test accuracy smaller than a threshold value to construct a list;
clustering: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
2. The method according to claim 1, wherein the data preprocessing step comprises in particular:
collecting news data relating to different companies;
extracting keywords and related news contents from each piece of collected news data, storing the keywords and the related news contents as text data, and removing non-text parts in the text data, wherein the keywords comprise company identification information;
carrying out equalization processing on the obtained text data based on the company identification information, and dividing the text data obtained after the equalization processing into two subsets without intersection to be respectively used as a training set and a test set;
constructing a stop word list and a special dictionary, combining a jieba Chinese word segmentation algorithm, segmenting each piece of text data in a training set and a test set, and intercepting the text data with the length exceeding a preset value;
and constructing each piece of data after word segmentation into a format with a category label, wherein different category labels represent different companies so as to be suitable for a fast text classification algorithm fasttext.
3. The method according to claim 1, characterized in that said step of classifying comprises in particular:
constructing a text classification model by using fasttext, inputting text data in a training set into the text classification model, and training to obtain an initial classifier; classifying and testing each piece of text data in the test set by using the initial classifier, and judging the test accuracy; training to obtain a final classifier by repeatedly adjusting parameters of the text classification model;
dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the classification of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list.
4. The method according to claim 1, characterized in that the clustering step comprises in particular:
finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, performing word segmentation, performing word frequency statistics, and constructing a keyword dictionary of each company by combining manual intervention;
converting each piece of text data after word segmentation in the last step into a multi-dimensional vector, clustering the multi-dimensional vectors related to each company, carrying out word frequency statistics according to keywords, dividing a plurality of clusters obtained by clustering into two samples, wherein the two samples are a positive sample and a negative sample, and training to obtain a binary classification model.
5. A multi-classification model training system, comprising:
a data pre-processing module to: performing primary processing on collected news data about a plurality of companies, performing balanced processing on the text data obtained after the primary processing, dividing the text data obtained after the balanced processing into two subsets which are respectively used as a training set and a test set, performing word segmentation processing on each piece of text data in the two subsets, and constructing each piece of data after the word segmentation processing into a format with category labels, wherein different category labels represent different companies;
a classification module to: constructing a text classification model, and training the text data in the training set to obtain a final classifier; dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the classification of each test subset by using a final classifier, and screening companies with the test accuracy smaller than a threshold value to construct a list;
a clustering module to: finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, converting each piece of text data into a vector, clustering the vector obtained by conversion, and training to obtain a binary classification model.
6. The system of claim 5, wherein the data preprocessing module is specifically configured to:
collecting news data relating to different companies;
extracting keywords and related news contents from each piece of collected news data, storing the keywords and the related news contents as text data, and removing non-text parts in the text data, wherein the keywords comprise company identification information;
carrying out equalization processing on the obtained text data based on the company identification information, and dividing the text data obtained after the equalization processing into two subsets without intersection to be respectively used as a training set and a test set;
constructing a stop word list and a special dictionary, combining a jieba Chinese word segmentation algorithm, segmenting each piece of text data in a training set and a test set, and intercepting the text data with the length exceeding a preset value;
and constructing each piece of data after word segmentation into a format with a category label, wherein different category labels represent different companies so as to be suitable for a fast text classification algorithm fasttext.
7. The system of claim 5, wherein the classification module is specifically configured to:
constructing a text classification model by using fasttext, inputting text data in a training set into the text classification model, and training to obtain an initial classifier; classifying and testing each piece of text data in the test set by using the initial classifier, and judging the test accuracy; training to obtain a final classifier by repeatedly adjusting parameters of the text classification model;
dividing the text data in the test set into a plurality of test subsets according to the category labels, testing the classification of each test subset by using a final classifier, and screening out companies with the test accuracy smaller than a threshold value to construct a list.
8. The system of claim 5, wherein the clustering module is specifically configured to:
finding out each piece of text data of each company in the list from the text data subjected to preliminary processing in the preprocessing step, performing word segmentation, performing word frequency statistics, and constructing a keyword dictionary of each company by combining manual intervention;
converting each piece of text data after word segmentation in the last step into a multi-dimensional vector, clustering the multi-dimensional vectors related to each company, carrying out word frequency statistics according to keywords, dividing a plurality of clusters obtained by clustering into two samples, wherein the two samples are a positive sample and a negative sample, and training to obtain a binary classification model.
9. A computer device comprising a processor and a memory, the memory having stored therein a program comprising computer-executable instructions that, when executed by the computer device, the processor executes the computer-executable instructions stored by the memory to cause the computer device to perform the multi-classification model training method of any one of claims 1-4.
10. A computer readable storage medium storing one or more programs, the one or more programs comprising computer executable instructions, which when executed by a computer device, cause the computer device to perform the multi-classification model training method of any of claims 1-4.
CN201911363343.3A 2019-12-26 2019-12-26 Multi-classification model training method, system and device Pending CN113051462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911363343.3A CN113051462A (en) 2019-12-26 2019-12-26 Multi-classification model training method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911363343.3A CN113051462A (en) 2019-12-26 2019-12-26 Multi-classification model training method, system and device

Publications (1)

Publication Number Publication Date
CN113051462A true CN113051462A (en) 2021-06-29

Family

ID=76505288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911363343.3A Pending CN113051462A (en) 2019-12-26 2019-12-26 Multi-classification model training method, system and device

Country Status (1)

Country Link
CN (1) CN113051462A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643116A (en) * 2021-08-23 2021-11-12 中远海运科技(北京)有限公司 Method for classifying companies based on financial voucher data, computer readable medium
CN116304058A (en) * 2023-04-27 2023-06-23 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN116432644A (en) * 2023-06-12 2023-07-14 南京邮电大学 News text classification method based on feature fusion and double classification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171870A1 (en) * 2007-12-31 2009-07-02 Yahoo! Inc. System and method of feature selection for text classification using subspace sampling
CN108764296A (en) * 2018-04-28 2018-11-06 杭州电子科技大学 More sorting techniques of study combination are associated with multitask based on K-means
CN109145108A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Classifier training method, classification method, device and computer equipment is laminated in text
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN110046636A (en) * 2018-12-11 2019-07-23 阿里巴巴集团控股有限公司 Prediction technique of classifying and device, prediction model training method and device
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171870A1 (en) * 2007-12-31 2009-07-02 Yahoo! Inc. System and method of feature selection for text classification using subspace sampling
CN109145108A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Classifier training method, classification method, device and computer equipment is laminated in text
CN108764296A (en) * 2018-04-28 2018-11-06 杭州电子科技大学 More sorting techniques of study combination are associated with multitask based on K-means
CN110046636A (en) * 2018-12-11 2019-07-23 阿里巴巴集团控股有限公司 Prediction technique of classifying and device, prediction model training method and device
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643116A (en) * 2021-08-23 2021-11-12 中远海运科技(北京)有限公司 Method for classifying companies based on financial voucher data, computer readable medium
CN113643116B (en) * 2021-08-23 2023-10-27 中远海运科技(北京)有限公司 Company classification method based on financial evidence data and computer readable medium
CN116304058A (en) * 2023-04-27 2023-06-23 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN116304058B (en) * 2023-04-27 2023-08-08 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN116432644A (en) * 2023-06-12 2023-07-14 南京邮电大学 News text classification method based on feature fusion and double classification
CN116432644B (en) * 2023-06-12 2023-08-15 南京邮电大学 News text classification method based on feature fusion and double classification

Similar Documents

Publication Publication Date Title
CN111414479B (en) Label extraction method based on short text clustering technology
WO2019214245A1 (en) Information pushing method and apparatus, and terminal device and storage medium
US7711673B1 (en) Automatic charset detection using SIM algorithm with charset grouping
CN113051462A (en) Multi-classification model training method, system and device
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN107168956B (en) Chinese chapter structure analysis method and system based on pipeline
CN110910175B (en) Image generation method for travel ticket product
CN104881458A (en) Labeling method and device for web page topics
US20200184280A1 (en) Differential classification using multiple neural networks
CN104361037A (en) Microblog classifying method and device
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
TWI828928B (en) Highly scalable, multi-label text classification methods and devices
CN115858474A (en) AIGC-based file arrangement system
CN114722198A (en) Method, system and related device for determining product classification code
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113010705B (en) Label prediction method, device, equipment and storage medium
Hussain et al. Design and analysis of news category predictor
CN116204610B (en) Data mining method and device based on named entity recognition of report capable of being ground
CN110175288B (en) Method and system for filtering character and image data for teenager group
CN109543049B (en) Method and system for automatically pushing materials according to writing characteristics
CN107368464B (en) Method and device for acquiring bidding product information
CN115329754A (en) Text theme extraction method, device and equipment and storage medium
CN110888977A (en) Text classification method and device, computer equipment and storage medium
CN111125345B (en) Data application method and device
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination