CN111611379A - Text information classification method, device, equipment and readable storage medium - Google Patents

Text information classification method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN111611379A
CN111611379A CN202010420260.XA CN202010420260A CN111611379A CN 111611379 A CN111611379 A CN 111611379A CN 202010420260 A CN202010420260 A CN 202010420260A CN 111611379 A CN111611379 A CN 111611379A
Authority
CN
China
Prior art keywords
classified
classification
information
word
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010420260.XA
Other languages
Chinese (zh)
Inventor
朱菁
潘斌强
李霁
张俊
杨建明
毛瑞彬
钱铁云
李旭晖
陈壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN SECURITIES INFORMATION CO Ltd
Original Assignee
SHENZHEN SECURITIES INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN SECURITIES INFORMATION CO Ltd filed Critical SHENZHEN SECURITIES INFORMATION CO Ltd
Priority to CN202010420260.XA priority Critical patent/CN111611379A/en
Publication of CN111611379A publication Critical patent/CN111611379A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a text information classification method, which comprises the following steps: obtaining linguistic data to be classified, and filtering the linguistic data to be classified to obtain information to be classified; carrying out feature extraction processing on the information to be classified to obtain subject features and word features corresponding to the information to be classified; classifying by using the subject characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified; after the corpus to be classified is filtered to obtain the information to be classified, the word characteristics and the subject characteristics are extracted and are utilized for classification operation, so that the complementary relevance of the word-level characteristics and the subject characteristics can be fully considered, the influence of the word-level characteristics can be considered during classification, the influence of the subject characteristics can be considered, and the accuracy of classification is improved; in addition, the invention also provides a text information classification device, equipment and a computer readable storage medium, and the text information classification device, the equipment and the computer readable storage medium also have the beneficial effects.

Description

Text information classification method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of text information classification technology, and in particular, to a text information classification method, a text information classification device, a text information classification apparatus, and a computer-readable storage medium.
Background
The natural language information classification technology is used for classifying information such as articles composed of natural language texts so as to facilitate reading by a user or mathematical statistics. The existing natural language text classification technology comprises a rule matching method and a machine learning method. The machine learning method is being widely used because it does not need to design matching rules, reducing a large amount of manual work. The existing machine learning method performs word-level feature extraction on natural language information and classifies the natural language information according to the extracted features. However, since some words or phrases can be used in many different scenarios, classifying only according to the word-level features at the level of one word leads to a problem of low classification accuracy.
Therefore, how to solve the problem of low classification accuracy in the prior art is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a text information classification method, a text information classification device, a text information classification apparatus, and a computer-readable storage medium, which solve the problem of low classification accuracy in the prior art.
In order to solve the technical problem, the invention provides a text information classification method, which comprises the following steps:
obtaining linguistic data to be classified, and filtering the linguistic data to be classified to obtain information to be classified;
performing feature extraction processing on the information to be classified to obtain subject features and word features corresponding to the information to be classified;
and performing classification operation by using the theme characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
Optionally, the performing feature extraction processing on the information to be classified to obtain the topic features and the word features corresponding to the information to be classified includes:
performing character-level feature extraction on the information to be classified by using a pre-training model to obtain the word features corresponding to the information to be classified;
and performing theme generation processing on the information to be classified by using a theme generation model to obtain the theme characteristics.
Optionally, the performing a classification operation by using the topic feature and the word feature to obtain a classification result corresponding to the corpus to be classified includes:
generating a fusion vector by using the word features and the theme features, and performing maximum pooling on the fusion vector to obtain a category distribution vector;
and cross-multiplying the transformation matrix and the category distribution vector to obtain the classification result.
Optionally, after obtaining the subject feature and the word feature corresponding to the information to be classified, the method further includes:
performing element extraction processing on each word feature to obtain an element group corresponding to the corpus to be classified;
and outputting the classification result and the element group.
Optionally, the filtering the corpus to be classified to obtain information to be classified includes:
and acquiring a filtering dictionary, and performing stop word filtering processing and punctuation filtering processing on the corpus to be classified by using the filtering dictionary to obtain the information to be classified.
Optionally, before the obtaining the corpus to be classified, the method further includes:
acquiring an original corpus, and constructing a training set by using the original corpus;
and acquiring an initial classification model, and training the initial classification model by using the training set to obtain a classification model.
Optionally, the training the initial classification model by using the training set to obtain a classification model includes:
inputting the training set into the initial classification model;
calculating a classification loss value corresponding to each training corpus in the training set;
adjusting parameters in the initial classification model according to the classification loss value;
and counting performance indexes corresponding to the initial classification model, and determining the initial classification model as the classification model when the performance indexes reach a preset threshold value.
The invention also provides a text information classification device, comprising:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring linguistic data to be classified and filtering the linguistic data to be classified to obtain information to be classified;
the characteristic extraction module is used for carrying out characteristic extraction processing on the information to be classified to obtain subject characteristics and word characteristics corresponding to the information to be classified;
and the classification module is used for performing classification operation by using the theme characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
The invention also provides text information classification equipment, which comprises a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the text information classification method.
The present invention also provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the above-described text information classification method.
The text information classification method provided by the invention obtains the linguistic data to be classified, and filters the linguistic data to be classified to obtain the information to be classified; carrying out feature extraction processing on the information to be classified to obtain subject features and word features corresponding to the information to be classified; and performing classification operation by using the subject characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
Therefore, after the corpus to be classified is filtered to obtain the information to be classified, the method not only extracts the word features of the information to be classified, but also extracts the theme features corresponding to the information to be classified. The topic features of the information to be classified are used for representing the topics of the whole corpus to be classified, and the complementary relevance of the word-level features and the topic features can be fully considered by extracting the word features and the topic features and utilizing the word features and the topic features to perform classification operation, so that the influence of the word-level features and the topic features can be considered during classification, the accuracy of classification is improved, and the problem of low classification accuracy of the existing machine learning method is solved.
In addition, the invention also provides a text information classification device, text information classification equipment and a computer readable storage medium, and the text information classification device, the text information classification equipment and the computer readable storage medium also have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a text information classification method according to an embodiment of the present invention;
fig. 2 is a flowchart of a specific text information classification method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text information classification apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text information classification device according to an embodiment of the present invention;
FIG. 5 is a classification model obtaining process according to an embodiment of the present invention;
fig. 6 is a flowchart of another specific text information classification method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a text information classification method according to an embodiment of the present invention. The method comprises the following steps:
s101: and obtaining the linguistic data to be classified, and filtering the linguistic data to be classified to obtain information to be classified.
In the embodiment of the present invention, some or all of the steps in the text information classification method may be executed by a specific device or terminal, and the specific type of the device or terminal is not limited, for example, the device or terminal may be a specific Windows system computer or a specific server.
The linguistic data to be classified is the text information to be classified, and a large-scale language example cannot be observed in the statistical natural language processing. Therefore, simply by substituting text and the context in the text as a substitute for the context of the language in the real world, a collection of text is called a Corpus (Corpus) which includes a plurality of Corpora, and when there are several such collections of text, we call a Corpus collection (Corpora). The specific content and type of the corpus to be classified are not limited in this embodiment, and may be, for example, news information, such as financial news information or sports news information; or may be teaching material information, such as Chinese teaching material information, or mathematical teaching material information.
And after the linguistic data to be classified are obtained, filtering the linguistic data to be classified so as to obtain information to be classified. The filtering process is used for removing invalid information in the corpus to be classified, such as punctuation marks, stop words, common words and the like, and after the filtering process, the valid information for classification in the corpus to be classified, namely the information to be classified, can be obtained. It should be noted that the information to be classified includes at least two word-level information, and the specific number of the information is not limited in this embodiment.
S102: and performing feature extraction processing on the information to be classified to obtain the subject feature and the word feature corresponding to the information to be classified.
In order to improve the classification accuracy, when feature extraction processing is performed on information to be classified, word-level feature extraction processing and topic feature extraction processing need to be performed. The word-level feature extraction processing may also be referred to as word-level feature extraction processing, and is used to generate word features (or referred to as word features) corresponding to information to be classified. The topic feature extraction processing may also be referred to as topic feature generation processing, and is used to extract topic features corresponding to the information to be classified. The word characteristics are used for reflecting the specific characteristics of the information content to be classified, the theme characteristics are used for reflecting the theme corresponding to the information to be classified, and the influence of the word-level characteristics and the theme characteristics can be considered during classification by utilizing the complementary relevance between the word characteristics and the theme characteristics, so that the classification accuracy is improved.
In this embodiment, the specific content and the number of the word features vary according to the variation of the information to be classified, for example, when different information to be classified has different words, the specific content of the corresponding word features is naturally different, and when the number of words in different information to be classified is different, the number of the corresponding word features is also different. Correspondingly, the number of the theme features is only one, but the content of the theme features corresponds to the information to be classified, the content of the information to be classified is different, and the corresponding theme features may be the same or different.
The embodiment does not specifically limit what method is used for feature extraction, and for example, the analysis program may be used for feature extraction, or the convolutional neural network or other neural network model may be used for feature extraction of information to be classified. For example, two different convolutional neural networks may be trained, which are respectively used for extracting word features and extracting topic features; or a convolutional neural network can be used for simultaneously extracting word features and theme features of information to be classified. The specific process of the feature extraction processing differs according to the adopted feature extraction method, and reference may be made to the related art.
S103: and performing classification operation by using the subject characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
After the subject feature and the word feature are obtained, classifying operation is performed by using the subject feature and the word feature, that is, classifying information to be classified by using the subject feature and the word feature, so that a classification result corresponding to the information to be classified can be obtained, and the classification result corresponding to the information to be classified is a classification result corresponding to the corpus to be classified.
Specifically, a convolutional neural network or other neural network models can be used for classification, that is, the topic features and the word features are input into the convolutional neural network, and are classified by using the convolutional neural network, so that a classification result is finally obtained. In order to improve the uniformity of feature extraction and classification, a neural network model can be used to perform all operations in steps S102 and S103, that is, feature extraction processing is performed on information to be classified first, and classification is performed according to subject features and word features after extraction, so as to obtain a classification result finally.
By applying the text information classification method provided by the embodiment of the invention, after the corpus to be classified is filtered to obtain the information to be classified, not only are the word characteristics of the information to be classified extracted, but also the topic characteristics corresponding to the information to be classified are extracted. The topic features of the information to be classified are used for representing the topics of the whole corpus to be classified, and the complementary relevance of the word-level features and the topic features can be fully considered by extracting the word features and the topic features and utilizing the word features and the topic features to perform classification operation, so that the influence of the word-level features and the topic features can be considered during classification, the accuracy of classification is improved, and the problem of low classification accuracy of the existing machine learning method is solved.
Based on the foregoing embodiment of the present invention, in a possible implementation manner, a classification model is used to classify corpora to be classified, please refer to fig. 2, and fig. 2 is a flowchart of a specific text information classification method provided in an embodiment of the present invention, including:
s201: and acquiring an original corpus, and constructing a training set by using the original corpus.
Before the corpus to be classified is classified by using the classification model, the corpus to be classified needs to be trained. It should be noted that the number of the original corpora is plural, and the original corpora is specifically a labeled text corpus, which is used for constructing a training set so as to generate a classification model. The annotation process may be manual annotation. Specifically, when labeling is performed, the text corpus can be labeled according to an existing mature system. The mature system is a category system that is well defined in the industry and used for representing specific categories of text corpora, and the category system corresponds to the types of the text corpora. For example, when the text corpus is financial news information, the mature system is a financial news information category system, which specifically includes five categories, i.e., top management change, listing and returning, procurement reorganization, investment financing and no event, wherein the top management change can be further divided into top management vocabularies, top management stop, top management employment and director change, listing and returning can be further divided into listing and listing, ending listing, IPO listing and purchasing, procurement reorganization can be further divided into debt, asset earning, right of stock procurement and purchasing, and investment financing can be further divided into short-term financing, delegation of financing, right of stock investment and information allocation. The original corpus should be the same kind as the corpus to be classified, such as financial news information or sports news information.
After the original corpus is obtained, a training set is constructed by using the original corpus so as to train the initial classification model. The training set may include all of the original languagesThe corpus may alternatively include a portion of the original corpus, and the remaining original corpus may be used to construct a detection set or a verification set. For example, D can be utilizedall={D1,...,DNRepresenting original corpora, wherein N represents the number of the original corpora, and generating a training set D according to the ratio of 8:1:1trainTest set DtestAnd a verification set Ddev
S202: and acquiring an initial classification model, and training the initial classification model by using a training set to obtain a classification model.
After the training set is obtained, the initial classification model is trained by the training set so as to obtain a classification model, and the classification model is the trained initial classification model. In the classification process, a statistical classification loss value is required to adjust the initial classification model according to the loss value, and specifically, S202 may include the following steps:
s2021: the training set is input into the initial classification model.
The embodiment does not limit the specific way of inputting the training set into the initial classification model, and the training corpora in the training set may be sequentially and respectively input into the initial classification model, or may be simultaneously input into the initial classification model. And the initial classification model performs feature extraction processing on the training corpus after acquiring the training corpus to obtain corresponding word-level features and corresponding theme features. Specifically, any one corpus may be represented by D ═ S1,...,SmDenotes, where S denotes sentences and m is the number of sentences. Each sentence may be represented by S ═ w1,...,wmDenotes, w denotes a word or a word, and n is the number of words or words.
And inputting the training set into an initial classification model, and then carrying out word feature extraction and theme feature extraction and classification on the training set. The specific extraction and classification method is the same as the processing method of the corpus to be classified, and this embodiment will be specifically described in the following steps.
S2022: and calculating the classification loss value corresponding to each training corpus in the training set.
Obtaining a training classification result after the classification is finished, calculating the cross entropy between the training classification result and the labeled label to obtain the class classification loss LeTherefore, the class classification loss can be determined as a classification loss value, specifically:
Figure BDA0002496706560000081
where j represents different classes (each class in the real hierarchy), YeFor the true tag of the event corresponding to the input sentence S, i.e. the annotated tag, PeTo train the classification results.
Further, in another embodiment, while obtaining the classification result, in order to better reflect the content of the text corpus, an element group corresponding to the text boredom may also be obtained, and the element group is obtained by element extraction processing of part or all of the word features. Therefore, in the training process, after the element extraction process is finished, the element extraction loss can be calculated so as to correct the initial classification model. Specifically, after the element group corresponding to the training corpus is obtained, the cross entropy between the element group and the labeled element group is calculated, and the element extraction loss L is obtainedaSpecifically, the method comprises the following steps:
Figure BDA0002496706560000082
wherein i represents each word or phrase, j represents different categories, YaThe real label of each character corresponding to the element. After obtaining the class classification loss and the element extraction loss, adding the two to obtain a final loss, i.e. a classification loss value L, specifically:
L=Le+La
s2023: and adjusting parameters in the initial classification model according to the classification loss value.
In this embodiment, the Adam algorithm is used to pass back the gradient and calculate an updated parameter value, and the parameter value is updated after the calculation is accepted, so as to complete the adjustment of the parameter in the initial classification model.
S2024: and counting the performance indexes corresponding to the initial classification model, and determining the initial classification model as the classification model when the performance indexes reach a preset threshold value.
After each round of training, the model is examined in a validation set DdevThe performance indicators may include one or more of, for example: precision, Recall, and macro-average F1 values. And each performance index corresponds to a preset threshold value respectively, and when each performance index reaches the preset threshold value, the initial classification model is determined as a classification model.
Further, in another embodiment, the performance indexes corresponding to the initial classification models obtained after each round of training are saved, and after a preset number of rounds of training, the performance indexes of the initial classification models are compared, and the initial classification model with the best performance index is determined as the classification model, so that the determination of the classification model is completed.
S203: and acquiring a filtering dictionary, and performing stop word filtering processing and punctuation filtering processing on the linguistic data to be classified by using the filtering dictionary to obtain information to be classified.
And after the classification model is obtained, the corpus to be classified can be processed. Specifically, after the expectation to be classified is obtained, it is filtered by using a filtering dictionary. In this embodiment, stop words and punctuation sets provided by NLTK are used to preprocess financial news texts, filter irrelevant information, and only store the content to be analyzed, i.e., the information to be classified.
S204: and performing character-level feature extraction on the information to be classified by using the pre-training model to obtain word features corresponding to the information to be classified.
It should be noted that the classification model in this embodiment includes a pre-training model and a topic generation model, where the pre-training model is used to extract word features, that is, to extract word-level features. The pre-training model is BERT (bidirectional encoding retrieval from transformations), which is a pre-training model trained by using two tasks of random mask word prediction and successive sentence prediction on a super-large scale unsupervised corpus, and can convert text Representation into corresponding low-dimensional embedded Representation, wherein linguistic knowledge migrated from the large scale external corpus is contained, and the Representation obtained after conversion can be called as word vectors. Compared with the traditional Word2Vec and other methods, the BERT model has the advantages of higher quality and richer semantics, and all parameters in the BERT model can be adaptively and dynamically changed along with the fine tuning training of the current data set.
In the embodiment of the invention, a BERT-base and Chinese version pre-training model published by Google is used as an original word vector source, and a dictionary file carried in the version is used for mapping S into a sequence number (ID) sequence. The BERT-base model consists of 12 layers of transform modules with implicit vector dimensions of 768, each layer of transform with 12 sets of multi-head attentions, totaling about 110M parameters. The vector representation E0 corresponding to the input sentence S can be obtained by BERT, specifically:
E0={e1,...,en}
where e is an element of the sequence number sequence, i.e., an element of the word vector.
After the word vectors are obtained, neighboring environment semantics are gradually captured for each word vector by utilizing a multilayer convolution Operation (conditional Operation) and supplemented into the current word vector, and finally word features are obtained, so that the word features have better accuracy and robustness compared with the word vectors.
In the embodiment of the present invention, three layers of convolution operations are adopted to blend the environmental semantic information of the word vector into the expression of the word, the size of each layer of convolution kernel is 3, and the convolution operation is expressed as:
El+1=ReLU(El*Fl+bl)
wherein l ∈ [0,2]Representing the number of corresponding layers of the current convolution, F and b respectively representing the convolution kernels of the current convolution layer, and ReLU representing a linear rectification function used as an activation function. The output of each layer is used as the input of the next layer, and the word vector representation E of the fusion environment information can be obtained in turn1、E2、E3The final output is the word feature.
S205: and performing theme generation processing on the information to be classified by using a theme generation model to obtain theme characteristics.
In this embodiment, in order to compensate for the missing topic features in the word-level feature extraction process, an open source tool is first adoptedLDA topic model in Scikit-Learn, for training set DtrainProcessing, setting the number of themes as K, and generating DtrainThe corresponding topic generates a model. When the corpus D to be classified is input, the topic generation model may generate a topic probability distribution T corresponding to D, specifically:
T={t1,...,tKare combined with
Figure BDA0002496706560000101
Subsequently, a set of topic vectors P is initialized, P ═ P1,...,pK}. Each p-vector represents a topic. Summing the topic distribution T corresponding to the corpus D with the classification and the topic vector set P after point multiplication to obtain a topic vector D corresponding to D, namely a topic feature D, and specifically comprising the following steps:
Figure BDA0002496706560000102
s206: and generating a fusion vector by using the word characteristics and the theme characteristics, and performing maximum pooling on the fusion vector to obtain a category distribution vector.
In this embodiment, the word feature E is obtained3And after the theme characteristics D are obtained, a Gate Mechanism (Gate Mechanism) is adopted to dynamically enhance or weaken the word vectors in each S according to the theme information of the corpus D to be classified, so that the model can sense the whole information of the corpus D to be classified, and a better effect is obtained. Specifically, the calculation process of the gate value g is as follows:
Figure BDA0002496706560000103
wherein, WgTo convert the matrix, sigmoid is an activation function that is used to limit the value of g to within (0, 1). After the threshold value g is obtained, a fusion vector E is calculated by using the threshold value gFComprises the following steps:
EF=g·E3
the fused vector is the final feature, which includes the whole information and the local information. After the fusion vector is obtained, performing maximum pooling treatment on the fusion vector, specifically:
s=maxpooling(EF)
and s is the category distribution vector.
S207: and cross multiplication is carried out on the conversion matrix and the category distribution vector to obtain a classification result.
After the class distribution vector is obtained, it is mapped to the class label space through a full-connected layer (full-connected layer). And finally, obtaining the probability of belonging to each event category through a softmax function. Under the condition of coexistence of Ce event classes, the output dimension is Ce +1, which comprises Ce event classes and a no-event class. Specifically, the final classification result can be obtained by cross-multiplying the transformation matrix and the category distribution vector, wherein the cross-multiplication formula is as follows:
Pe=We×s
Penamely the classification result.
S208: and performing element extraction processing on each word characteristic to obtain an element group corresponding to the corpus to be classified.
Optionally, in another embodiment, an element group corresponding to the corpus to be classified may also be obtained, and is used to display the specific content of the corpus to be classified. In the embodiment of the invention, the element group may be set as { time, place, object, trigger word }, and each word in the expectation to be classified may belong to { time, place, object, trigger word, non-element }. And mapping the fusion vector to an element label space through a layer of full connection layer, and obtaining the probability that the word belongs to each element through a softmax function. Under the condition of coexistence of Ca elements, the output dimension is Ce +1, which comprises Ce element classes and a non-element class.
Specifically, the fusion vector E is obtainedFThen, the event elements existing in S, i.e. the predictions { w }, are extracted1,...,wnThe probability distribution of each word or word in the set { time, place, object, trigger word, non-element }. Therefore, the invention directly maps the feature vector of each word or character to the label space, and specifically comprises the following steps:
Figure BDA0002496706560000111
wherein, PaIs an element group.
S209: and outputting the classification result and the element group.
In a possible implementation manner, the classification result and the element group are obtained and then output, and a specific output path is not limited in this embodiment, and for example, the classification result and the element group may be output through a preset port.
Based on the above embodiments, a specific text information classification method will be described in the embodiments of the present invention. Referring to fig. 5 and fig. 6, fig. 5 is a process for obtaining a classification model according to an embodiment of the present invention, and fig. 6 is a flowchart of another specific text information classification method according to an embodiment of the present invention.
Before the corpus to be classified is classified, a classification model is obtained according to the flow of fig. 5. In this embodiment, the original corpus is financial information of a mobile phone from the internet. The blossoming information is acquired and labeled according to a financial event system, and the financial event system can be a mature system in related documents and can also be adaptively formulated according to actual requirements. The internet resource provenance may include government agency websites, financial portal websites, WeChat public numbers, and the like. The labeling adopts a manual labeling method, and two parts of contents need to be labeled during labeling: firstly, marking the specific category of the financial news in the corresponding financial event system, and secondly, firstly dividing characters and then marking element tuples contained in the financial news, namely the element groups.
And after the labeling is finished, preprocessing text information, namely filtering, and dividing the text information into a training set and a testing set, namely training data and testing data, and training the initial classification model by utilizing the training data to obtain a classification model, namely a financial news analysis model. And the financial news analysis model is tested by using the test set, and the news category and the element four-tuple, namely the element group, are finally output in the test process.
After the financial news analysis model is obtained, the linguistic data to be classified, namely news text is input, word feature extraction is carried out by using BERT and multilayer convolution, and theme features (namely theme vectors) are generated by using LDA. And performing fusion enhancement representation after the word features and the theme features are obtained to obtain fusion vectors, and acquiring class information and element tuples by using the fusion vectors to finish the operation of classifying the linguistic data to be classified and extracting the element tuples.
Specifically, when the corpus to be classified is "23 days, information of the supernatant, and the fifth-phase ultrashort-term financing instrument of Xiamen corporation limited in 2019 is successfully issued in 21 days 10 and 21 in 2019 and issued with a monetary amount of 10 billion dollars", the corpus is input into a financial news analysis model, and then the event type output by the model is "short-phase financing", and the element group is { 21 days 10 and 21 in 2019, Xiamen corporation limited in Yue corporation, ultrashort-phase financing instrument }.
In the following, the text information classification device provided in the embodiment of the present invention is introduced, and the text information classification device described below and the text information classification method described above may be referred to in correspondence with each other.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text information classification apparatus according to an embodiment of the present invention, including:
the obtaining module 310 is configured to obtain corpora to be classified, and filter the corpora to be classified to obtain information to be classified;
the feature extraction module 320 is configured to perform feature extraction processing on the information to be classified to obtain a topic feature and a word feature corresponding to the information to be classified;
and the classification module 330 is configured to perform a classification operation by using the topic features and the word features to obtain a classification result corresponding to the corpus to be classified.
Optionally, the feature extraction module 320 includes:
the word feature extraction unit is used for performing word-level feature extraction on the information to be classified by using the pre-training model to obtain word features corresponding to the information to be classified;
and the theme feature extraction unit is used for performing theme generation processing on the information to be classified by using the theme generation model to obtain the theme features.
Optionally, the classification module 330 includes:
the fusion unit is used for generating a fusion vector by utilizing the word characteristics and the theme characteristics and performing maximum pooling processing on the fusion vector to obtain a category distribution vector;
and the result acquisition unit is used for cross multiplication of the conversion matrix and the category distribution vector to obtain a classification result.
Optionally, the method further comprises:
the element group extraction module is used for performing element extraction processing on each word characteristic to obtain an element group corresponding to the corpus to be classified;
and the output module is used for outputting the classification result and the element group.
Optionally, the obtaining module 310 includes:
and the stop word filtering unit is used for acquiring the filtering dictionary, and performing stop word filtering processing and punctuation filtering processing on the linguistic data to be classified by using the filtering dictionary to obtain information to be classified.
Optionally, the method further comprises:
the training set building module is used for obtaining the original linguistic data and building a training set by utilizing the original linguistic data;
and the training module is used for obtaining the initial classification model and training the initial classification model by utilizing the training set to obtain the classification model.
Optionally, a training module comprising:
the input unit is used for inputting the training set into the initial classification model;
the calculation unit is used for calculating a classification loss value corresponding to each training corpus in the training set;
the adjusting unit is used for adjusting parameters in the initial classification model according to the classification loss values;
and the statistical unit is used for counting the performance indexes corresponding to the initial classification model, and determining the initial classification model as the classification model when the performance indexes reach a preset threshold value.
In the following, the text information classification device provided by the embodiment of the present invention is introduced, and the text information classification device described below and the text information classification method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a text information classifying device according to an embodiment of the present invention, where the text information classifying device includes a memory and a processor, where:
a memory 410 for storing a computer program;
the processor 420 is configured to execute a computer program to implement the text information classification method.
In the following, the computer-readable storage medium provided by the embodiment of the present invention is introduced, and the computer-readable storage medium described below and the text information classification method described above may be referred to correspondingly.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned text information classification method. The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The text information classification method, the text information classification device and the computer readable storage medium provided by the present invention are described in detail above, and specific examples are applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A text information classification method is characterized by comprising the following steps:
obtaining linguistic data to be classified, and filtering the linguistic data to be classified to obtain information to be classified;
performing feature extraction processing on the information to be classified to obtain subject features and word features corresponding to the information to be classified;
and performing classification operation by using the theme characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
2. The method for classifying text information according to claim 1, wherein the performing feature extraction processing on the information to be classified to obtain subject features and word features corresponding to the information to be classified comprises:
performing character-level feature extraction on the information to be classified by using a pre-training model to obtain the word features corresponding to the information to be classified;
and performing theme generation processing on the information to be classified by using a theme generation model to obtain the theme characteristics.
3. The method for classifying text information according to claim 1, wherein said performing classification operation by using the topic feature and the word feature to obtain a classification result corresponding to the corpus to be classified comprises:
generating a fusion vector by using the word features and the theme features, and performing maximum pooling on the fusion vector to obtain a category distribution vector;
and cross-multiplying the transformation matrix and the category distribution vector to obtain the classification result.
4. The method for classifying text information according to claim 1, further comprising, after obtaining the subject feature and the word feature corresponding to the information to be classified:
performing element extraction processing on each word feature to obtain an element group corresponding to the corpus to be classified;
and outputting the classification result and the element group.
5. The method for classifying text information according to claim 1, wherein said filtering the corpus to be classified to obtain information to be classified comprises:
and acquiring a filtering dictionary, and performing stop word filtering processing and punctuation filtering processing on the corpus to be classified by using the filtering dictionary to obtain the information to be classified.
6. The method for classifying text information according to any one of claims 1 to 5, further comprising, before the obtaining the corpus to be classified:
acquiring an original corpus, and constructing a training set by using the original corpus;
and acquiring an initial classification model, and training the initial classification model by using the training set to obtain a classification model.
7. The method for classifying text information according to claim 6, wherein the training the initial classification model by using the training set to obtain a classification model comprises:
inputting the training set into the initial classification model;
calculating a classification loss value corresponding to each training corpus in the training set;
adjusting parameters in the initial classification model according to the classification loss value;
and counting performance indexes corresponding to the initial classification model, and determining the initial classification model as the classification model when the performance indexes reach a preset threshold value.
8. A text information classification apparatus, comprising:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring linguistic data to be classified and filtering the linguistic data to be classified to obtain information to be classified;
the characteristic extraction module is used for carrying out characteristic extraction processing on the information to be classified to obtain subject characteristics and word characteristics corresponding to the information to be classified;
and the classification module is used for performing classification operation by using the theme characteristics and the word characteristics to obtain a classification result corresponding to the corpus to be classified.
9. A text information classifying device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the text information classification method according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the method of classifying text information according to any one of claims 1 to 7.
CN202010420260.XA 2020-05-18 2020-05-18 Text information classification method, device, equipment and readable storage medium Pending CN111611379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010420260.XA CN111611379A (en) 2020-05-18 2020-05-18 Text information classification method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010420260.XA CN111611379A (en) 2020-05-18 2020-05-18 Text information classification method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111611379A true CN111611379A (en) 2020-09-01

Family

ID=72204876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010420260.XA Pending CN111611379A (en) 2020-05-18 2020-05-18 Text information classification method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111611379A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408093A (en) * 2014-11-14 2015-03-11 中国科学院计算技术研究所 News event element extracting method and device
CN105808722A (en) * 2016-03-08 2016-07-27 苏州大学 Information discrimination method and system
KR20170034206A (en) * 2015-09-18 2017-03-28 아주대학교산학협력단 Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN110569351A (en) * 2019-09-02 2019-12-13 北京猎云万罗科技有限公司 Network media news classification method based on restrictive user preference

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408093A (en) * 2014-11-14 2015-03-11 中国科学院计算技术研究所 News event element extracting method and device
KR20170034206A (en) * 2015-09-18 2017-03-28 아주대학교산학협력단 Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN105808722A (en) * 2016-03-08 2016-07-27 苏州大学 Information discrimination method and system
CN110569351A (en) * 2019-09-02 2019-12-13 北京猎云万罗科技有限公司 Network media news classification method based on restrictive user preference

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data

Similar Documents

Publication Publication Date Title
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
WO2022116536A1 (en) Information service providing method and apparatus, electronic device, and storage medium
CN110414004B (en) Method and system for extracting core information
CN111221944B (en) Text intention recognition method, device, equipment and storage medium
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN114330354A (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN113672731B (en) Emotion analysis method, device, equipment and storage medium based on field information
CN111680501B (en) Query information identification method and device based on deep learning and storage medium
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN111611379A (en) Text information classification method, device, equipment and readable storage medium
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN114492390A (en) Data expansion method, device, equipment and medium based on keyword recognition
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
CN113688633A (en) Outline determination method and device
CN114638229A (en) Entity identification method, device, medium and equipment of record data
CN114239555A (en) Training method of keyword extraction model and related device
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200901