WO2020199595A1 - 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质 - Google Patents

基于词袋模型的长文本分类方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020199595A1
WO2020199595A1 PCT/CN2019/117706 CN2019117706W WO2020199595A1 WO 2020199595 A1 WO2020199595 A1 WO 2020199595A1 CN 2019117706 W CN2019117706 W CN 2019117706W WO 2020199595 A1 WO2020199595 A1 WO 2020199595A1
Authority
WO
WIPO (PCT)
Prior art keywords
bag
words
feature vector
model
long text
Prior art date
Application number
PCT/CN2019/117706
Other languages
English (en)
French (fr)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020199595A1 publication Critical patent/WO2020199595A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • This application relates to the technical field of text classification, and in particular to a long text classification method, device, computer equipment and storage medium based on a bag of words model.
  • Text classification is an important application of natural language processing, and it can also be said to be the most basic application.
  • Common text classification applications include: news text classification, information retrieval, sentiment analysis, intention judgment, etc.
  • long text classification models are mainly based on word vector features and deep learning models. Although such models have high accuracy, they require high computing power; they cannot have both high accuracy and low performance requirements, which limits some applications , Such as mobile applications.
  • the embodiments of the present application provide a long text classification method, device, computer equipment, and storage medium based on a bag-of-words model, which have high classification accuracy and low requirements for calculation performance.
  • this application provides a long text classification method based on a bag-of-words model.
  • the method includes:
  • the long text to be classified is classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain classification data.
  • this application provides a long text classification device based on a bag of words model, the device comprising:
  • the long text obtaining module is used to obtain the long text to be classified
  • the filtering module is used to filter out noise characters in the long text according to preset rules
  • the first extraction module is configured to extract a first bag-of-words feature vector from the long text with the noise characters filtered out based on the first bag-of-words model, and the dictionary of the first bag-of-words model includes several words;
  • the second extraction module is configured to extract a second bag-of-words feature vector from the long text from which the noise characters are filtered out based on the second bag-of-words model, and the dictionary of the second bag-of-words model includes several single words;
  • the classification module is configured to classify the long text to be classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain classification data based on the classification model.
  • the present application provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and when the computer is executed
  • the program implements the above-mentioned long text classification method based on the bag-of-words model.
  • the present application provides a computer-readable storage medium that stores a computer program. If the computer program is executed by a processor, the above-mentioned long text classification method based on the bag of words model is implemented .
  • This application discloses a long text classification method, device, computer equipment and storage medium based on a bag of words model.
  • the word-level feature vector of a long text with noisy characters is extracted through the first bag of words model and the second bag of words model is used. Extract the word-level feature vector of the long text with noisy characters, and then classify the long text according to the word-level feature vector and the text-level feature vector; the classification is based on richer information, so the classification result is more accurate; at the same time, the word-bag model
  • the amount of extracted feature vector data is small, and the requirement for computing power is low.
  • FIG. 1 is a schematic flowchart of a long text classification method based on a bag-of-words model according to an embodiment of this application;
  • Figure 2 is a schematic diagram of an application scenario of the long text classification method in Figure 1;
  • Fig. 3 is a schematic diagram of a sub-process of filtering noise characters in Fig. 1;
  • Figure 4 is a schematic diagram of the sub-process of constructing the dictionary in the first bag of words model
  • Fig. 5 is a schematic diagram of a sub-process of extracting a first bag-of-words feature vector in Fig. 1;
  • Figure 6 is a schematic diagram of the sub-process of constructing the dictionary in the second bag of words model
  • FIG. 7 is a schematic diagram of a sub-process of extracting a second bag-of-words feature vector in FIG. 1;
  • Fig. 8 is a schematic diagram of a sub-process of classifying long text in Fig. 1;
  • Figure 9 is a schematic diagram of a sub-process of the random forest training phase
  • FIG. 10 is a schematic flowchart of a long text classification method based on a bag-of-words model according to another embodiment of this application;
  • Figure 11 is a schematic diagram of a sub-process of training the first dimensionality reduction model
  • Figure 12 is a schematic diagram of a sub-process of training the second dimensionality reduction model
  • FIG. 13 is a schematic diagram of a sub-process of classifying long text in FIG. 10;
  • Figure 14 is a schematic diagram of a sub-process of training a random forest model
  • 15 is a schematic structural diagram of a long text classification device based on a bag-of-words model according to an embodiment of the application;
  • FIG. 16 is a schematic structural diagram of a computer device provided by an embodiment of this application.
  • the embodiments of the present application provide a long text classification method, device, equipment and storage medium based on a bag of words model.
  • the long text classification method can be applied to a terminal or a server for news text classification, information retrieval, sentiment analysis, intention judgment, etc.
  • the long text classification method based on the bag-of-words model is used in the server, of course, it can be used in the terminal, such as mobile phones, notebooks, etc.
  • the following embodiments will introduce the long text classification method applied to the server in detail.
  • FIG. 1 is a schematic flowchart of a long text classification method based on a bag of words model provided by an embodiment of the present application.
  • the long text classification method based on the bag of words model includes the following steps:
  • Step S110 Obtain a long text to be classified.
  • the long text to be classified is the text stored locally by the device for implementing the long text classification method based on the bag-of-words model, the text obtained by the device from the network, and the input from the device connected to it.
  • the server obtains the long text to be classified from the terminal. Both the server and the terminal are connected to the Internet. After the user enters the long text at the terminal, the terminal transmits the long text to the server.
  • Step S120 Filter out noisy characters in the long text according to a preset rule.
  • noise characters such as special symbols and non-Chinese characters in the long text are filtered out according to preset rules.
  • step S120 filters out noisy characters in the long text according to a preset rule, which specifically includes:
  • Step S121 Obtain a preset stop word database.
  • the stop word database includes several stop words.
  • some special symbols, non-Chinese characters and other noise characters can be specified as stop words according to application scenarios, so as to construct a stop word database and save it in the form of a configuration file.
  • the stop word database related to the application scenario is called.
  • Stop words can be, for example, punctuation marks, " ⁇ ", " ⁇ ”, etc. These words can be regarded as invalid words, which will affect subsequent operations in the form of noise and need to be removed.
  • Step S122 If the stop word is found in the long text, delete the stop word in the long text or replace it with a preset symbol.
  • each stop word in the stop word database is searched separately whether it appears in the long text, and if it appears, the stop word in the long text is deleted; in other embodiments, the stop word database is searched separately Whether each stop word in the long text appears in the long text, if it appears, replace the stop word in the long text with a preset symbol, such as a space, to preserve the structure of the long text to a certain extent.
  • Step S130 based on the first bag-of-words model, extract a first bag-of-words feature vector from the long text from which the noise characters are filtered out.
  • Bag-of-words (Bag-of-words, BOW) is a representation of the text that describes the occurrence of word elements in a document.
  • the bag-of-words model is a method of representing text data when modeling text with machine learning algorithms. It involves two aspects: a collection of known words, and testing the existence of known words.
  • the bag-of-words model divides a piece of text into words, and imagines putting all words in a bag, ignoring the word order, grammar, syntax and other elements, and treating it as just a collection of several words. Each word in the text The appearance of is independent and does not depend on the appearance of other words.
  • the dictionary of the first bag-of-words model includes several words.
  • the words in the dictionary of the first bag-of-words model may be simple words or compound words.
  • the whole word can only express one meaning and cannot be disassembled; such as single-syllable simple words such as person, bird, mountain, high, green, walking, flying, etc., and two-syllable simple words like, panic, extraordinar, shy, etc.
  • Dragonflies wandering, bats, grapes, sofas, plops, cuckoos, golf, Ding Ling Dang Cang, etc.
  • Compound words are composed of several morphemes, and they still have meaning when taken apart; such as length, opening, sculpture, fast food, special zone, computer, bidding, investment, leading, expanding, extending, cloth, case, room, National Day, summer solstice, River, teacher, aunt, number one, just now, and other verb-complement phrases such as can't help but laugh, and have endless aftertastes, as well as unparalleled in the world, heartache like cut, danger and life, clear conscience, vibrant and clear welcome, etc.
  • the construction process of the dictionary in the first bag-of-words model includes the following steps:
  • the training data includes several sample long texts.
  • the sample long text has some commonality with the long text to be classified, that is, it is related to the application scenario of the long text classification method; for example, the source is the same, the scene is the same, and the purpose is the same, for example, all are derived from the news text.
  • the sample long text in the training data also removes noisy characters.
  • the training data includes two sample long texts, namely Xiao Ming likes watching movies and Xiao Ming also likes playing football.
  • a dictionary of the first bag-of-words model can be constructed ⁇ 1: "Xiao Ming", 2: “Like”, 3: “Watch”, 4: "Movie” 5: “Also”, 6: “ Kick”, 7: "football” ⁇ .
  • each word corresponds to its own index number.
  • step S130 is based on the first bag-of-words model to extract the first bag-of-words feature vector from the long text from which the noise characters are filtered out, which specifically includes:
  • Step S131 Initialize the first bag-of-words feature vector of all zeros.
  • the elements in the first bag-of-words feature vector correspond one-to-one with words in the dictionary of the first bag-of-words model.
  • Step S132 Count the number of occurrences of each of the words in the long text with the noise characters filtered out.
  • Step S133 Assign a value to the corresponding element in the first bag-of-words feature vector according to the number of times the word appears in the long text.
  • the first bag-of-words feature vector is [1, 1, 1, 1, 0, 0, 0]. If the long text that removes noise characters is "Xiao Ming likes watching movies and Xiao Ming likes playing football", the first bag of words feature vector is [2, 2, 1, 1, 1, 1, 1].
  • Step S140 based on the second bag-of-words model, extract a second bag-of-words feature vector from the long text with the noise characters filtered out.
  • the dictionary of the second bag-of-words model includes several single words.
  • the construction process of the dictionary in the second bag-of-words model includes the following steps:
  • the training data includes a number of sample long texts.
  • the sample long text has some commonality with the long text to be classified, that is, it is related to the application scenario of the long text classification method; for example, the source is the same, the scene is the same, and the purpose is the same, for example, all are derived from the news text.
  • the sample long text in the training data also removes noisy characters.
  • the training data includes two sample long texts, namely Xiao Ming likes watching movies and Xiao Ming also likes playing football.
  • a dictionary of the second bag-of-words model can be constructed ⁇ 1: " ⁇ ", 2: “ ⁇ ", 3: “ ⁇ ", 4: “huan”, 5: “look”, 6: “Dian", 7: “Shadow”, 8: "Also", 9: “Kick", 10: "Foot", 11: “Ball” ⁇ .
  • each character corresponds to its own index number.
  • step S140 is based on the second bag-of-words model to extract a second bag-of-words feature vector from the long text with the noise characters filtered out, which specifically includes:
  • Step S141 Initialize a second bag-of-words feature vector of all zeros.
  • the elements in the second bag-of-words feature vector correspond one-to-one with a single word in the dictionary of the second bag-of-words model.
  • Step S142 Count the number of occurrences of each of the single characters in the long text from which the noise characters are filtered out.
  • Step S143 Assign a value to the corresponding element in the second bag-of-words feature vector according to the number of times the single text appears in the long text.
  • the second bag-of-words feature vector is [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0] . If the long text that removes noise characters is "Xiao Ming likes watching movies and Xiao Ming also likes playing football", the second bag of words feature vector is [2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1 ].
  • Step S150 Based on the classification model, classify the long text to be classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain classification data.
  • the bag of words feature vector extracted from the long text based on the bag of words model can represent some characteristics of the long text to be classified, and the long text to be classified can be classified according to the first bag of words feature vector and the second bag of words feature vector; exemplary , Attach this long text to be classified with the category mark obtained by classification, for example, to classify the long text into social, entertainment, economic or archaeological categories.
  • the long text to be classified is classified; the classification is based on richer information, thus The classification result is more accurate.
  • step S150 is based on a classification model and classifies the long text to be classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain a classification Data, including:
  • Step S151 Fusion the first bag-of-words feature vector and the second bag-of-words feature vector.
  • the one-dimensional first bag-of-words feature vector and the one-dimensional second bag-of-words feature vector are spliced into a one-dimensional fusion vector.
  • Step S152 Input the fused vector into the trained random forest model to obtain the category of the long text to be classified.
  • the fused vector can not only characterize the characteristics of the words in the text to be classified, but also the characteristics of the words in the text to be classified, so that the information expressed is richer.
  • the long text to be classified is classified according to the fused vector, and the classification result is more accurate.
  • obtaining the category of the long text to be classified according to the fused vector can be achieved through a variety of classification models, such as artificial neural network model, KNN algorithm model, support vector machine SVM algorithm model, and decision making. Tree algorithm model, etc.
  • classification models such as artificial neural network model, KNN algorithm model, support vector machine SVM algorithm model, and decision making. Tree algorithm model, etc.
  • the long text to be classified is classified based on a random forest model to obtain the category of the long text to be classified.
  • the random forest model includes several decision trees, and each decision tree in the random forest is not related.
  • each decision tree in the random forest model is judged and classified separately; then, to see which category is selected the most, predict the corresponding length of this vector Which category the text is.
  • Random forest is a non-traditional machine learning algorithm, composed of multiple decision trees, each of which processes a subset of training samples.
  • the training sample of the random forest model includes multiple sample long texts, and each sample file is labeled with its type; each sample long text is extracted from the first bag-of-words feature vector, the second bag-of-words feature vector, and the vectors are fused to obtain multiple Training vector; then part of the training vector with multiple replacements is taken from the multiple training vectors to form multiple training sample subsets; and then each corresponding decision tree is trained according to each training sample subset.
  • the establishment and training of the random forest model can be achieved through the sklearn library in Python.
  • the features are screened through node splitting of the decision tree, and the training vector is subdivided layer by layer until each subset of training samples is correctly classified.
  • samples are classified directly based on the trained training vectors.
  • the process of the random forest training phase is specifically:
  • Step S31 Randomly select n training vectors from among the N type-labeled training vectors, and use the selected n training vectors as samples at the root node of a decision tree.
  • n is a natural number not greater than N; with replacement refers to randomly selecting a sample each time, and then returning to continue the selection; the selected n training vectors are used to train a decision tree as the sample at the root node of the decision tree .
  • Step S32 randomly selecting m attributes from the M attributes of the training vector, and selecting one of the m attributes as the splitting attribute of the corresponding node of the decision tree according to the preset strategy.
  • the dimension of the training vector is M
  • m attributes are randomly selected from these M attributes, and the condition m ⁇ M is satisfied;
  • a certain strategy is adopted in the attributes, such as information gain to select one attribute as the split attribute of the node.
  • Step S33 Split the corresponding nodes of the decision tree until it can no longer be split, so as to establish the decision tree shown.
  • no more splitting means that all the leaf nodes are reached, that is, convergence; if the next attribute selected by the node is the attribute used when its parent node splits, then the node is a leaf node.
  • Step S34 Establish a preset number of decision trees to form a random forest.
  • the randomness of random forest is reflected in that the training samples of each decision tree are random, and the split attribute set of each node in each decision tree is also randomly selected and determined. With these two random guarantees, the probability of overfitting in the random forest can be reduced, thereby improving the accuracy of classification.
  • step S150 based on the classification model, the long text to be classified is classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain Before classifying data, it also includes:
  • Step S101 Perform dimensionality reduction on the first bag-of-words feature vector based on the first dimensionality reduction model.
  • the scale of the first bag-of-words feature vector may be very large, and the scale of the feature vector can be reduced on the basis of retaining most of the information of the first bag-of-words feature vector by dimensionality reduction to reduce the amount of calculation.
  • the training method of the first dimensionality reduction model includes:
  • Step S41 Obtain an initial first dimensionality reduction model.
  • the first dimensionality reduction model includes a first coding layer and a first decoding layer, and the dimensions of the input of the first coding layer and the dimensions of the output of the first decoding layer are equal to the first bag of words model The number of words in the dictionary, and the output of the first coding layer is used as the input of the first decoding layer.
  • Step S42 Input the first dimensionality reduction training sample into the first dimensionality reduction model to obtain the output vector of the first decoding layer.
  • the first coding layer obtains the hidden features of the first dimensionality reduction training sample through coding processing, and reduces the dimensionality of the first dimensionality reduction training sample; the first decoding layer restores the hidden features through decoding.
  • Step S43 Adjust the parameters of the first dimensionality reduction model according to the loss between the output vector of the first decoding layer and the first dimensionality reduction training sample.
  • the training objective of the first dimensionality reduction model is to minimize the difference between the input vector and the output vector, so as to ensure that the implicit features output by the first coding layer retain the original input information and reduce the dimensionality.
  • the training loss function is a mean squared error (MSE)
  • the optimization algorithm is an adaptive moment estimation (adaptive moment estimation) ADAM optimization algorithm.
  • ADAM optimization algorithm is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data.
  • Step S44 If the loss between the output vector and the first dimensionality reduction training sample meets a preset training condition, use the output of the first coding layer as the output of the one-dimensionality reduction model.
  • the difference between the input vector and the output vector is less than the preset threshold, the loss satisfies the preset training condition, and the training goal is achieved.
  • the output of the first coding layer can be used as the input vector after dimensionality reduction.
  • Step S102 Perform dimensionality reduction on the second bag-of-words feature vector based on the second dimensionality reduction model.
  • the scale of the second bag-of-words feature vector may be very large, and the scale of the feature vector can be reduced on the basis of retaining most of the information of the second bag-of-words feature vector by dimensionality reduction to reduce the amount of calculation.
  • the training method of the second dimensionality reduction model includes:
  • Step S51 Obtain an initial second dimensionality reduction model.
  • the second dimensionality reduction model includes a second coding layer and a second decoding layer; wherein the dimensions of the input of the second coding layer and the dimensions of the output of the second decoding layer are equal to the second bag of words The number of single words in the dictionary of the model, and the output of the second coding layer is used as the input of the second decoding layer.
  • Step S52 Input a second dimensionality reduction training sample into the second dimensionality reduction model to obtain an output vector of the second decoding layer.
  • the second coding layer obtains the hidden features of the second dimensionality reduction training sample through coding processing, and reduces the dimensionality of the second dimensionality reduction training sample; the second decoding layer restores the hidden features through decoding.
  • Step S53 Adjust the parameters of the second dimensionality reduction model according to the loss between the output vector of the second decoding layer and the second dimensionality reduction training sample.
  • the training objective of the second dimensionality reduction model is to minimize the difference between the input vector and the output vector, so as to ensure that the hidden features of the output of the second coding layer retain the original input information and the dimensionality is reduced.
  • Step S54 If the loss between the output vector and the second dimensionality reduction training sample meets a preset training condition, use the output of the second coding layer as the output of the two-dimensionality reduction model.
  • the difference between the input vector and the output vector is less than the preset threshold, the loss satisfies the preset training condition, and the training goal is achieved.
  • the output of the second coding layer can be used as the input vector after dimensionality reduction.
  • the establishment and training of the first dimensionality reduction model and the second dimensionality reduction model can be implemented through the tensorflow library in Python.
  • step S150 is based on the classification model to classify the long text to be classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain classification data, which specifically includes:
  • Step S153 Based on the classification model, classify the long text to be classified according to the reduced-dimensionality first bag-of-words feature vector and the reduced-dimensionality second bag-of-words feature vector to obtain classification data.
  • the feature vector of the bag of words extracted from the long text based on the bag of words model can represent some characteristics of the long text to be classified. After the first bag of words feature vector and the second bag of words feature decibels are reduced, most of the information can still be saved. Classify the long text to be classified according to the first bag-of-words feature vector after dimensionality reduction and the second bag-of-words feature vector after dimensionality reduction to obtain classification data, for example, classify the long text as social, entertainment, economic or archaeological And other categories.
  • step S153 is based on the classification model, and classifies the long text to be classified according to the reduced-dimensional first bag-of-words feature vector and the reduced-dimensional second bag-of-words feature vector to obtain the classification Data, including:
  • step S1531 the dimensionality-reduced first bag-of-words feature vector is merged with the dimensionality-reduced second bag-of-words feature vector.
  • the dimensionality-reduced first bag-of-words feature vector and the dimensionality-reduced second bag-of-words feature vector are spliced into a one-dimensional fusion vector.
  • Step S1532 Input the fused vector into the trained random forest model to obtain the category of the long text to be classified.
  • the fused vector can not only characterize the characteristics of the words in the text to be classified, but also the characteristics of the words in the text to be classified, so that the information expressed is richer.
  • the long text to be classified is classified according to the fused vector, and the classification result is more accurate.
  • the training method of the random forest model includes:
  • Step S61 Obtain a sample long text and a classification mark corresponding to the sample long text.
  • the sample long text has some commonality with the long text to be classified, that is, it is related to the application scenario of the long text classification method; for example, the source is the same, the scene is the same, and the purpose is the same, for example, all are derived from the news text.
  • the long text of each sample corresponds to the corresponding classification mark, such as society, entertainment, economy, or archaeology.
  • Step S62 Filter out noise characters in the sample long text according to a preset rule.
  • a preset stop word database which includes several stop words; if the stop word is found in the long text, stop the stop word in the long text Delete with words or replace with preset symbols.
  • Step S63 Based on the first bag-of-words model, extract a first sample feature vector from the sample long text from which the noise character is filtered out.
  • the dictionary of the first bag-of-words model includes several words.
  • Step S64 Based on the second bag-of-words model, extract a second sample feature vector from the sample long text from which the noise character is filtered out.
  • the dictionary of the second bag-of-words model includes several single words.
  • Step S65 Perform dimensionality reduction on the first sample feature vector based on the first dimensionality reduction model, and perform dimensionality reduction on the second sample feature vector based on the second dimensionality reduction model.
  • the scale of the feature vector of the first sample and the feature vector of the second sample may be very large, and the features can be reduced while retaining most of the information of the first sample feature vector and the second sample feature vector through dimensionality reduction.
  • the scale of the vector to reduce the amount of calculation.
  • Step S66 Combine the reduced dimensionality of the first sample feature vector and the second sample feature vector into a training vector corresponding to the classification mark.
  • the dimensionality-reduced first sample feature vector and the dimensionality-reduced second sample feature vector are spliced into a one-dimensional fusion vector.
  • Step S67 Train the random forest model according to a number of the training vectors and the classification marks corresponding to each of the training vectors.
  • the random forest model is trained according to the aforementioned steps S31 to S34.
  • the word-level feature vector of the long text with noisy characters is extracted through the first bag-of-words model, and the long text with the noisy characters is extracted through the second bag of words model
  • the text-level feature vector and then classify the long text according to the word-level feature vector and the text-level feature vector; the classification is based on richer information, so the classification result is more accurate; at the same time, the amount of feature vector data extracted by the bag of words model is smaller, The requirements for computing power are low.
  • FIG. 15 is a schematic structural diagram of a long text classification device based on a bag-of-words model provided by an embodiment of the present application.
  • the long text classification device can be configured in a server or a terminal to execute the aforementioned word-based classification. Long text classification method based on bag model.
  • the long text classification device based on the bag-of-words model includes:
  • the long text obtaining module 110 is used to obtain the long text to be classified.
  • the filtering module 120 is configured to filter noisy characters in the long text according to a preset rule.
  • the filtering module 120 includes:
  • the stop word acquisition sub-module 121 is used to acquire a preset stop word database, and the stop word database includes several stop words.
  • the filtering sub-module 122 is configured to delete the stop word in the long text or replace it with a preset symbol if the stop word is found in the long text.
  • the first extraction module 130 is configured to extract a first bag-of-words feature vector from the long text from which the noise characters are filtered out based on the first bag-of-words model.
  • the dictionary of the first bag-of-words model includes several words.
  • the second extraction module 140 is configured to extract a second bag-of-words feature vector from the long text from which the noise characters are filtered out based on the second bag-of-words model, and the dictionary of the second bag-of-words model includes several single words.
  • the second extraction module 140 includes:
  • the classification module 150 is configured to classify the long text to be classified according to the first bag-of-words feature vector and the second bag-of-words feature vector to obtain classification data based on the classification model.
  • the classification module 150 includes:
  • the fusion sub-module 151 is used for fusing the first bag-of-words feature vector and the second bag-of-words feature vector;
  • the classification sub-module 152 is configured to input the fused vector into the trained random forest model to obtain the category of the long text to be classified.
  • the method and device of this application can be used in many general or special computing system environments or configurations.
  • the above method and apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 16.
  • FIG. 16 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the computer equipment can be a server or a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any long text classification method based on the bag of words model.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any long text classification method based on the bag of words model.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the structure of the computer device is only a block diagram of a part of the structure related to the solution of the application, and does not constitute a limitation on the computer device to which the solution of the application is applied.
  • the specific computer device may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
  • the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium stores a computer program
  • the computer program includes program instructions
  • the processor executes the program instructions to implement any item provided in the embodiments of this application based on Long text classification method based on bag of words model.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于词袋模型的长文本分类方法、装置、计算机设备及存储介质,包括:基于第一词袋模型,从长文本提取第一词袋特征向量,第一词袋模型的词典包括若干词语;基于第二词袋模型,从长文本提取第二词袋特征向量,第二词袋模型的词典包括若干单个文字;基于分类模型,根据第一词袋特征向量和第二词袋特征向量得到分类数据。

Description

基于词袋模型的长文本分类方法、装置、计算机设备及存储介质
本申请要求于2019年4月4日提交中国专利局、申请号为201910268933.1、发明名称为“基于词袋模型的长文本分类方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文本分类技术领域,尤其涉及一种基于词袋模型的长文本分类方法、装置、计算机设备及存储介质。
背景技术
文本分类是自然语言处理的重要应用,也可以说是最基础的应用。常见的文本分类应用有:新闻文本分类、信息检索、情感分析、意图判断等。
目前长文本分类模型主要基于词向量特征以及深度学习模型,虽然此类模型具有较高精度,但是需要较高的计算能力;无法兼具高精度和较低的性能需求,因此限制了一些应用场合,例如移动端的应用。
发明内容
本申请实施例提供一种基于词袋模型的长文本分类方法、装置、计算机设备及存储介质,具有较高的分类准确性且对计算性能的需求较低。
第一方面,本申请提供了一种基于词袋模型的长文本分类方法,所述方法包括:
获取待分类的长文本;
根据预设规则滤除所述长文本中的噪音字符;
基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
第二方面,本申请提供了一种基于词袋模型的长文本分类装置,所述装置包括:
长文本获取模块,用于获取待分类的长文本;
滤除模块,用于根据预设规则滤除所述长文本中的噪音字符;
第一提取模块,用于基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
第二提取模块,用于基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
分类模块,用于基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
第三方面,本申请提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机程序;所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现上述的基于词袋模型的长文本分类方法。
第四方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,若所述计算机程序被处理器执行,实现上述的基于词袋模型的长文本分类方法。
本申请公开了一种基于词袋模型的长文本分类方法、装置、计算机设备及存储介质,通过第一词袋模型提取滤除噪音字符的长文本的词语级特征向量以及通过第二词袋模型提取滤除噪音字符的长文本的文字级特征向量,然后根据词语级特征向量和文字级特征向量对长文本进行分类;分类所依据的信息更丰富,从而分类结果更准确;同时通过词袋模型提取的特征向量数据量较小,对计算能力的要求较低。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一实施方式的基于词袋模型的长文本分类方法的流程示意图;
图2为图1中长文本分类方法应用场景的示意图;
图3为图1中滤除噪音字符的子流程示意图;
图4为构建第一词袋模型中词典的子流程示意图;
图5为图1中提取第一词袋特征向量的子流程示意图;
图6为构建第二词袋模型中词典的子流程示意图;
图7为图1中提取第二词袋特征向量的子流程示意图;
图8为图1中对长文本进行分类的子流程示意图;
图9为随机森林训练阶段的子流程示意图;
图10为本申请另一实施方式的基于词袋模型的长文本分类方法的流程示意图;
图11为训练第一降维模型的子流程示意图;
图12为训练第二降维模型的子流程示意图;
图13为图10中对长文本进行分类的子流程示意图;
图14为训练随机森林模型的子流程示意图;
图15为本申请一实施例的基于词袋模型的长文本分类装置的结构示意图;
图16为本申请一实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。另外,虽然在装置示意图中进行了功能模块的划分,但是在某些情况下,可以以不同于装置示意图中的模块划分。
本申请的实施例提供了一种基于词袋模型的长文本分类方法、装置、设备及存储介质。其中,该长文本分类方法可以应用于终端或服务器中,以用于新闻文本分类、信息检索、情感分析、意图判断等。
例如,基于词袋模型的长文本分类方法用于服务器,当然可以用于终端,比如手机、笔记本等。但为了便于理解,以下实施例将以应用于服务器的长文 本分类方法进行详细介绍。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参阅图1,图1是本申请的实施例提供的一种基于词袋模型的长文本分类方法的流程示意图。
如图1所示,基于词袋模型的长文本分类方法包括以下步骤:
步骤S110、获取待分类的长文本。
在一些可选的实施例中,待分类的长文本为用于实现基于词袋模型的长文本分类方法的装置在本地存储的文本、该装置从网络获取的文本、该装置从与其连接的输入装置获取的文本、该装置从其他电子设备获取的文本、该装置根据语音信息转成的文本等。
如图2所示,服务器从终端获取待分类的长文本,服务器与终端均连接于互联网,用户在终端输入长文本后,终端将该长文本传输至服务器。
步骤S120、根据预设规则滤除所述长文本中的噪音字符。
在一些实施例中,根据预设规则滤除长文本中的特殊符号、非中文字符等噪音字符。
在一些可选的实施例中,如图3所示,步骤S120根据预设规则滤除所述长文本中的噪音字符,具体包括:
步骤S121、获取预设的停用词库。
其中,所述停用词库包括若干停用词。
具体的,可以根据应用场景需要规定一些特殊符号、非中文字符等噪音字符为停用词,以构建停用词库,以配置文件的形式保存起来。服务器在执行步骤S121时调取与应用场景相关的停用词库。
停用词例如可以为:标点符号、“的”、“得”等等,这些词汇可以看作无效词,会以噪音的形式影响后续运算,需要去除。
步骤S122、若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
在一些实施例中,分别查找停用词库中的各停用词是否在长文本中出现,若出现则删除长文本中的停用词;在另一些实施例中,分别查找停用词库中的各停用词是否在长文本中出现,若出现则将长文本中的停用词替换为预设符号,如空格等,以在一定程度上保留长文本的结构。
步骤S130、基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量。
词袋(Bag-of-words,BOW)是描述文档中单词元素出现的文本的一种表示形式。词袋模型是用机器学习算法对文本进行建模时表示文本数据的方法。它涉及两件方面:已知单词的集合、测试已知单词的存在。
词袋模型把一段文本划分成一个个词,想象成将所有词放入一个袋子里,忽略其词序、语法、句法等要素,将其仅仅看作是若干个词汇的集合,文本中每个词的出现都是独立的,不依赖于其他词是否出现。
在本实施例中,所述第一词袋模型的词典包括若干词语。
示例性的,第一词袋模型的词典中的词语可以为单纯词或合成词。其中单纯词整个词只能表示一个意思,不能拆开;如单音节单纯词人、鸟、山、高、绿、走、飞等,又如双音节单纯词仿佛、忐忑、玲珑、腼腆、蜻蜓、徘徊、蝙蝠、葡萄、沙发、扑通、布谷、高尔夫、丁零当啷等。合成词是由几个语素组成的,拆开来仍旧有意义;如长短、开放、雕塑、快餐、特区、电脑、招标、 投资、牵头、扩大、延长、布匹、案件、房间、国庆、夏至、河流、老师、阿姨、第一、刚刚,又如忍俊不禁、回味无穷等动补短语,又如举世无双、心痛如割、险象跌生、问心无愧、玲珑剔透、热烈欢迎等。
在一些可选的实施例中,如图4所示,第一词袋模型中词典的构建流程包括以下步骤:
S11、获取训练数据。
其中,训练数据包括若干条样本长文本。
具体的,样本长文本与待分类的长文本具有一些通性,即与长文本分类方法的应用场景相关;例如来源相同、场景相同、用途相同等,例如均来源于新闻文本。
具体的,训练数据中的样本长文本也去除了噪音字符。
S12、根据训练数据中的样本长文本获取第一词袋模型的词典中的词语。
示例性的,训练数据包括两条样本长文本,分别为小明喜欢看电影、小明也喜欢踢足球。根据这两条样本长文本可以构建出第一词袋模型的词典{1:“小明”,2:“喜欢”,3:“看”,4:“电影”5:“也”,6:“踢”,7:“足球”}。按照该词典中词语排列的顺序,各词语对应于各自的索引序号。
在一些可选的实施例中,如图5所示,步骤S130基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,具体包括:
步骤S131、初始化全零的第一词袋特征向量。
其中,所述第一词袋特征向量中的元素与所述第一词袋模型的词典中的词语一一对应。
示例性的,根据第一词袋模型的词典{1:“小明”,2:“喜欢”,3:“看”,4:“电影”5:“也”,6:“踢”,7:“足球”}初始化全零的第一词袋特征向量为[0,0,0,0,0,0,0]。
步骤S132、统计各所述词语在滤除所述噪音字符的长文本中出现的次数。
步骤S133、根据所述词语在所述长文本中出现的次数对所述第一词袋特征向量中对应的元素赋值。
示例性的,如果去除噪音字符的长文本为“小明喜欢看电影”,则第一词袋特征向量为[1,1,1,1,0,0,0]。如果去除噪音字符的长文本为“小明喜欢看电影小明也喜欢踢足球”,则第一词袋特征向量为[2,2,1,1,1,1,1]。
步骤S140、基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量。
其中,所述第二词袋模型的词典包括若干单个文字。
在一些可选的实施例中,如图6所示,第二词袋模型中词典的构建流程包括以下步骤:
S21、获取训练数据。
其中训练数据包括若干条样本长文本。
具体的,样本长文本与待分类的长文本具有一些通性,即与长文本分类方法的应用场景相关;例如来源相同、场景相同、用途相同等,例如均来源于新闻文本。
具体的,训练数据中的样本长文本也去除了噪音字符。
S22、根据训练数据中的样本长文本获取第二词袋模型的词典中的单个文字。
示例性的,训练数据包括两条样本长文本,分别为小明喜欢看电影、小明也喜欢踢足球。根据这两条样本长文本可以构建出第二词袋模型的词典{1:“小”,2:“明”,3:“喜”,4:“欢”,5:“看”,6:“电”,7:“影”,8:“也”,9:“踢”, 10:“足”,11:“球”}。按照该词典中单个文字排列的顺序,各文字对应于各自的索引序号。
在一些可选的实施例中,如图7所示,步骤S140基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,具体包括:
步骤S141、初始化全零的第二词袋特征向量。
其中,所述第二词袋特征向量中的元素与所述第二词袋模型的词典中的单个文字一一对应。
示例性的,根据第二词袋模型的词典{1:“小”,2:“明”,3:“喜”,4:“欢”,5:“看”,6:“电”,7:“影”,8:“也”,9:“踢”,10:“足”,11:“球”}初始化一个11维的全零向量作为初始化的第二词袋特征向量。
步骤S142、统计各所述单个文字在滤除所述噪音字符的长文本中出现的次数。
步骤S143、根据所述单个文字在所述长文本中出现的次数对所述第二词袋特征向量中对应的元素赋值。
示例性的,如果去除噪音字符的长文本为“小明喜欢看电影”,则第二词袋特征向量为[1,1,1,1,1,1,1,0,0,0,0]。如果去除噪音字符的长文本为“小明喜欢看电影小明也喜欢踢足球”,则第二词袋特征向量为[2,2,2,2,1,1,1,1,1,1,1]。
步骤S150、基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
基于词袋模型从长文本提取的词袋特征向量可以表征待分类的长文本的一些特点,根据第一词袋特征向量和第二词袋特征向量可以对待分类的长文本进行分类;示例性的,将这一待分类的长文本附上分类得到的类别标记,以例如将长文本分类为社会、娱乐、经济或者考古等类别。
根据第一词袋特征向量表征的待分类文本中词语的特点,以及第二词袋特征向量表征待的分类文本中文字的特点,对待分类的长文本进行分类;分类依据的信息更丰富,从而分类结果更准确。
在一些可选的实施例中,如图8所示,步骤S150基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
步骤S151、将所述第一词袋特征向量和所述第二词袋特征向量融合。
示例性的,将一维的第一词袋特征向量和一维的第二词袋特征向量拼接为一个一维的融合向量。
步骤S152、将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
融合后的向量既可以表征待分类文本中词语的特点,也可以表征待分类文本中文字的特点,从而表达的信息更丰富。根据融合后的向量对待分类的长文本进行分类,分类结果更准确。
在一些可选的实施例中,根据融合后的向量获取所述待分类的长文本的类别可以通过多种分类模型实现,例如人工神经网络模型、KNN算法模型、支持向量机SVM算法模型、决策树算法模型等。
在本实施例中,基于随机森林模型对所述待分类的长文本进行分类以获取所述待分类的长文本的类别。
随机森林模型包括若干个决策树,随机森林的每一棵决策树之间是没有关联的。当将融合后的向量输入训练好的随机森林模型的时候,就让随机森林模 型中的每一棵决策树分别进行判断分类;然后看看哪一类被选择最多,就预测这个向量相应的长文本为哪一类别。
随机森林属于非传统的机器学习算法,由多颗决策树组成,每棵决策树处理的是一个训练样本子集。例如,随机森林模型的训练样本包括多个样本长文本,各样本文件标注了所属类型;各样本长文本经过提取第一词袋特征向量、第二词袋特征向量,以及向量融合后得到多个训练向量;然后多次有放回的从所述多个训练向量中取部分训练向量组成多个训练样本子集;之后根据各训练样本子集训练各各自对应的决策树。具体的,随机森林模型的建立和训练,可以通过Python中的sklearn库实现。
在训练阶段,通过决策树的节点分裂来筛选特征,对训练向量进行层层细分,直至将每个训练样本子集分类正确。在测试阶段,直接基于训练出的训练向量进行样本分类。
在一些实施例中,如图9所示,随机森林训练阶段的流程具体为:
步骤S31、从N个标注了类型的训练向量中有放回的随机选择n个训练向量,将所选择的n个训练向量作为一个决策树根节点处的样本。
其中,n为不大于N的自然数;有放回指的是每次随机选择一个样本,然后返回继续选择;所选择的n个训练向量用来训练一个决策树,作为决策树根节点处的样本。
步骤S32、随机从训练向量的M个属性中选取出m个属性,并根据预设策略从m个属性中选择一个作为所示决策树相应节点的分裂属性。
示例性的,假设当训练向量的维数为M,在决策树的每个节点需要分裂时,随机从这M个属性中选取出m个属性,满足条件m<<M;然后从这m个属性中采用某种策略,如信息增益来选择1个属性作为该节点的分裂属性。
步骤S33、对所述决策树的相应节点进行分裂,直至不能够再分裂为止,以建立所示决策树。
所谓不能再分裂,就是全部到达叶子节点,即收敛;如果下一次该节点选出来的那一个属性是刚刚其父节点分裂时用过的属性,则该节点就是叶子节点。
步骤S34、建立预设数量的决策树,以构成随机森林。
随机森林的随机性体现在每颗决策树的训练样本是随机的,各决策树中每个节点的分裂属性集合也是随机选择确定的。有了这两个随机的保证,可以降低随机森林产生过拟合现象的概率,从而提高分类的准确率。
在一些可选的实施方式中,如图10所示,在步骤S150基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据之前,还包括:
步骤S101、基于第一降维模型对所述第一词袋特征向量进行降维。
在一些应用场景中,第一词袋特征向量的规模可能很大,可以通过降维在保留第一词袋特征向量大部分信息的基础上降低特征向量的规模,以减少计算量。
在一些实施例中,如图11所示,所述第一降维模型的训练方法包括:
步骤S41、获取初始的第一降维模型。
其中,所述第一降维模型包括第一编码层和第一解码层,所述第一编码层输入的维数、所述第一解码层输出的维数均等于所述第一词袋模型的词典中词语的数目,所述第一编码层的输出作为所述第一解码层的输入。
步骤S42、将第一降维训练样本输入所述第一降维模型,以获取所述第一解码层的输出向量。
第一编码层通过编码处理获取第一降维训练样本的隐含特征,并降低第一降维训练样本的维数;第一解码层通过解码将隐含特征还原。
步骤S43、根据所述第一解码层的输出向量和所述第一降维训练样本之间的损失调整所述第一降维模型的参数。
具体的,第一降维模型的训练目标是使得输入向量与输出向量的差异最小化,以保证第一编码层输出的隐含特征保留原始输入信息而维度降低。示例性的,训练损失函数为均方误差(mean squared error,MSE),优化算法为适应性矩估计(adaptive moment estimation)ADAM优化算法。ADAM优化算法是一种可以替代传统随机梯度下降过程的一阶优化算法,它能基于训练数据迭代地更新神经网络权重。
步骤S44、若所述输出向量和所述第一降维训练样本之间的损失满足预设的训练条件,将所述第一编码层的输出作为所述一降维模型的输出。
示例性的,输入向量与输出向量的差异小于预设阈值时损失满足预设的训练条件,实现训练目标。第一编码层的输出可以作为输入的向量降维后的向量。
步骤S102、基于第二降维模型对所述第二词袋特征向量进行降维。
在一些应用场景中,第二词袋特征向量的规模可能很大,可以通过降维在保留第二词袋特征向量大部分信息的基础上降低特征向量的规模,以减少计算量。
在一些实施例中,如图12所示,所述第二降维模型的训练方法包括:
步骤S51、获取初始的第二降维模型。
其中,所述第二降维模型包括第二编码层和第二解码层;其中所述第二编码层输入的维数、所述第二解码层输出的维数均等于所述第二词袋模型的词典中单个文字的数目,所述第二编码层的输出作为所述第二解码层的输入。
步骤S52、将第二降维训练样本输入所述第二降维模型,以获取所述第二解码层的输出向量。
第二编码层通过编码处理获取第二降维训练样本的隐含特征,并降低第二降维训练样本的维数;第二解码层通过解码将隐含特征还原。
步骤S53、根据所述第二解码层的输出向量和所述第二降维训练样本之间的损失调整所述第二降维模型的参数。
具体的,第二降维模型的训练目标是使得输入向量与输出向量的差异最小化,以保证第二编码层输出的隐含特征保留原始输入信息而维度降低。
步骤S54、若所述输出向量和所述第二降维训练样本之间的损失满足预设的训练条件,将所述第二编码层的输出作为所述二降维模型的输出。
示例性的,输入向量与输出向量的差异小于预设阈值时损失满足预设的训练条件,实现训练目标。第二编码层的输出可以作为输入的向量降维后的向量。
在一些可选的实施例中,第一降维模型、第二降维模型的建立与训练可以通过Python中的tensorflow库实现。
在本实施例中,步骤S150基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
步骤S153、基于分类模型,根据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
基于词袋模型从长文本提取的词袋特征向量可以表征待分类的长文本的一些特点,在第一词袋特征向量、第二词袋特征分贝降维后仍可保存大部分信息,因此可以据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,例如将长文本分类为社会、娱乐、经 济或者考古等类别。
具体的,如图13所示,步骤S153基于分类模型,根据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
步骤S1531、将降维后的第一词袋特征向量和降维后的第二词袋特征向量融合。
示例性的,将降维后的第一词袋特征向量和降维后的第二词袋特征向量拼接为一个一维的融合向量。
步骤S1532、将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
融合后的向量既可以表征待分类文本中词语的特点,也可以表征待分类文本中文字的特点,从而表达的信息更丰富。根据融合后的向量对待分类的长文本进行分类,分类结果更准确。
在一些可选的实施例中,如图14所示,随机森林模型的训练方法包括:
步骤S61、获取样本长文本和与所述样本长文本对应的分类标记。
具体的,样本长文本与待分类的长文本具有一些通性,即与长文本分类方法的应用场景相关;例如来源相同、场景相同、用途相同等,例如均来源于新闻文本。各样本长文本分别对应于相应的分类标记,如社会、娱乐、经济或者考古等。
步骤S62、根据预设规则滤除所述样本长文本中的噪音字符。
具体的,先获取预设的停用词库,所述停用词库包括若干停用词;若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
步骤S63、基于所述第一词袋模型,从滤除所述噪音字符的样本长文本提取第一样本特征向量。
所述第一词袋模型的词典包括若干词语。
步骤S64、基于所述第二词袋模型,从滤除所述噪音字符的样本长文本提取第二样本特征向量。
所述第二词袋模型的词典包括若干单个文字。
步骤S65、基于所述第一降维模型对所述第一样本特征向量进行降维,以及基于所述第二降维模型对所述第二样本特征向量进行降维。
在一些应用场景中,第一样本特征向量、第二样本特征向量的规模可能很大,可以通过降维在保留第一样本特征向量、第二样本特征向量大部分信息的基础上降低特征向量的规模,以减少计算量。
步骤S66、将降维后的第一样本特征向量、第二样本特征向量组合为与所述分类标记对应的训练向量。
示例性的,将降维后的第一样本特征向量和降维后的第二样本特征向量拼接为一个一维的融合向量。
步骤S67、根据若干所述训练向量和与各所述训练向量对应的分类标记对所述随机森林模型进行训练。
具体的,根据前述步骤S31-步骤S34对所述随机森林模型进行训练。
上述实施例提供的基于词袋模型的长文本分类方法,通过第一词袋模型提取滤除噪音字符的长文本的词语级特征向量以及通过第二词袋模型提取滤除噪音字符的长文本的文字级特征向量,然后根据词语级特征向量和文字级特征向量对长文本进行分类;分类所依据的信息更丰富,从而分类结果更准确;同时 通过词袋模型提取的特征向量数据量较小,对计算能力的要求较低。
请参阅图15,图15是本申请一实施例提供的一种基于词袋模型的长文本分类装置的结构示意图,该长文本分类装置可以配置于服务器或终端中,用于执行前述的基于词袋模型的长文本分类方法。
如图15所示,该基于词袋模型的长文本分类装置,包括:
长文本获取模块110,用于获取待分类的长文本。
滤除模块120,用于根据预设规则滤除所述长文本中的噪音字符。
在一些实施例中,如图16所示,滤除模块120包括:
停用词获取子模块121,用于获取预设的停用词库,所述停用词库包括若干停用词。
滤除子模块122,用于若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
第一提取模块130,用于基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语。
第二提取模块140,用于基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字。
在一些实施例中,第二提取模块140包括:
分类模块150,用于基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
在一些实施例中,分类模块150包括:
融合子模块151,用于将所述第一词袋特征向量和所述第二词袋特征向量融合;
分类子模块152,用于将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块、单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请的方法、装置可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、机顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
示例性的,上述的方法、装置可以实现为一种计算机程序的形式,该计算机程序可以在如图16所示的计算机设备上运行。
请参阅图16,图16是本申请实施例提供的一种计算机设备的结构示意图。该计算机设备可以是服务器或终端。
参阅图16,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种基于词袋模型的长文本分类方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种基于词袋模型的长文本分类方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可 以理解,该计算机设备的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法,如:
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项基于词袋模型的长文本分类方法。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于词袋模型的长文本分类方法,其包括:
    获取待分类的长文本;
    根据预设规则滤除所述长文本中的噪音字符;
    基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
    基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
    基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据;
    其中,所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
    将所述第一词袋特征向量和所述第二词袋特征向量融合;
    将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
  2. 如权利要求1所述的长文本分类方法,其中,所述根据预设规则滤除所述长文本中的噪音字符,具体包括:
    获取预设的停用词库,所述停用词库包括若干停用词;
    若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
  3. 如权利要求1所述的长文本分类方法,其中,所述从滤除所述噪音字符的长文本提取第一词袋特征向量,具体包括:
    初始化全零的第一词袋特征向量,所述第一词袋特征向量中的元素与所述第一词袋模型的词典中的词语一一对应;
    统计各所述词语在滤除所述噪音字符的长文本中出现的次数;
    根据所述词语在所述长文本中出现的次数对所述第一词袋特征向量中对应的元素赋值;
    所述从滤除所述噪音字符的长文本提取第二词袋特征向量,具体包括:
    初始化全零的第二词袋特征向量,所述第二词袋特征向量中的元素与所述第二词袋模型的词典中的单个文字一一对应;
    统计各所述单个文字在滤除所述噪音字符的长文本中出现的次数;
    根据所述单个文字在所述长文本中出现的次数对所述第二词袋特征向量中对应的元素赋值。
  4. 如权利要求3所述的长文本分类方法,其中,在所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据之前,还包括:
    基于第一降维模型对所述第一词袋特征向量进行降维;
    基于第二降维模型对所述第二词袋特征向量进行降维;
    所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
    基于分类模型,根据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
  5. 如权利要求4所述的长文本分类方法,其中,所述第一降维模型的训练包括:
    获取初始的第一降维模型,所述第一降维模型包括第一编码层和第一解码层;其中所述第一编码层输入的维数、所述第一解码层输出的维数均等于所述 第一词袋模型的词典中词语的数目,所述第一编码层的输出作为所述第一解码层的输入;
    将第一降维训练样本输入所述第一降维模型,以获取所述第一解码层的输出向量;
    根据所述第一解码层的输出向量和所述第一降维训练样本之间的损失调整所述第一降维模型的参数;
    若所述输出向量和所述第一降维训练样本之间的损失满足预设的训练条件,将所述第一编码层的输出作为所述一降维模型的输出;
    所述第二降维模型的训练包括:
    获取初始的第二降维模型,所述第二降维模型包括第二编码层和第二解码层;其中所述第二编码层输入的维数、所述第二解码层输出的维数均等于所述第二词袋模型的词典中单个文字的数目,所述第二编码层的输出作为所述第二解码层的输入;
    将第二降维训练样本输入所述第二降维模型,以获取所述第二解码层的输出向量;
    根据所述第二解码层的输出向量和所述第二降维训练样本之间的损失调整所述第二降维模型的参数;
    若所述输出向量和所述第二降维训练样本之间的损失满足预设的训练条件,将所述第二编码层的输出作为所述二降维模型的输出。
  6. 如权利要求4所述的长文本分类方法,其中,所述随机森林模型的训练包括:
    获取样本长文本和与所述样本长文本对应的分类标记;
    根据预设规则滤除所述样本长文本中的噪音字符;
    基于所述第一词袋模型,从滤除所述噪音字符的样本长文本提取第一样本特征向量;
    基于所述第二词袋模型,从滤除所述噪音字符的样本长文本提取第二样本特征向量;
    基于所述第一降维模型对所述第一样本特征向量进行降维,以及基于所述第二降维模型对所述第二样本特征向量进行降维;
    将降维后的第一样本特征向量、降维后的第二样本特征向量组合为与所述分类标记对应的训练向量;
    根据若干所述训练向量和与各所述训练向量对应的分类标记对所述随机森林模型进行训练。
  7. 一种基于词袋模型的长文本分类装置,其中,包括:
    长文本获取模块,用于获取待分类的长文本;
    滤除模块,用于根据预设规则滤除所述长文本中的噪音字符;
    第一提取模块,用于基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
    第二提取模块,用于基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
    分类模块,用于基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据;
    其中,所述分类模块包括:
    融合子模块,用于将所述第一词袋特征向量和所述第二词袋特征向量融合;
    分类子模块,用于将融合后的向量输入训练好的随机森林模型,以获取所 述待分类的长文本的类别。
  8. 如权利要求7所述的长文本分类装置,其中,所述滤除模块包括:
    停用词获取子模块,用于获取预设的停用词库,所述停用词库包括若干停用词;
    滤除子模块,用于若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;
    所述存储器用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现:
    获取待分类的长文本;
    根据预设规则滤除所述长文本中的噪音字符;
    基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
    基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
    基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据;
    其中,所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
    将所述第一词袋特征向量和所述第二词袋特征向量融合;
    将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
  10. 如权利要求9所述的计算机设备,其中,所述处理器实现所述根据预设规则滤除所述长文本中的噪音字符时,用于实现:
    获取预设的停用词库,所述停用词库包括若干停用词;
    若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
  11. 如权利要求9所述的计算机设备,其中,所述处理器实现所述从滤除所述噪音字符的长文本提取第一词袋特征向量时,用于实现:
    初始化全零的第一词袋特征向量,所述第一词袋特征向量中的元素与所述第一词袋模型的词典中的词语一一对应;
    统计各所述词语在滤除所述噪音字符的长文本中出现的次数;
    根据所述词语在所述长文本中出现的次数对所述第一词袋特征向量中对应的元素赋值;
    所述从滤除所述噪音字符的长文本提取第二词袋特征向量时,用于实现:
    初始化全零的第二词袋特征向量,所述第二词袋特征向量中的元素与所述第二词袋模型的词典中的单个文字一一对应;
    统计各所述单个文字在滤除所述噪音字符的长文本中出现的次数;
    根据所述单个文字在所述长文本中出现的次数对所述第二词袋特征向量中对应的元素赋值。
  12. 如权利要求11所述的计算机设备,其中,所述处理器实现所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据之前,还用于实现:
    基于第一降维模型对所述第一词袋特征向量进行降维;
    基于第二降维模型对所述第二词袋特征向量进行降维;
    所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据时,用于实现:
    基于分类模型,根据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
  13. 如权利要求12所述的计算机设备,其中,所述处理器实现所述第一降维模型的训练时,用于实现:
    获取初始的第一降维模型,所述第一降维模型包括第一编码层和第一解码层;其中所述第一编码层输入的维数、所述第一解码层输出的维数均等于所述第一词袋模型的词典中词语的数目,所述第一编码层的输出作为所述第一解码层的输入;
    将第一降维训练样本输入所述第一降维模型,以获取所述第一解码层的输出向量;
    根据所述第一解码层的输出向量和所述第一降维训练样本之间的损失调整所述第一降维模型的参数;
    若所述输出向量和所述第一降维训练样本之间的损失满足预设的训练条件,将所述第一编码层的输出作为所述一降维模型的输出;
    所述第二降维模型的训练包括:
    获取初始的第二降维模型,所述第二降维模型包括第二编码层和第二解码层;其中所述第二编码层输入的维数、所述第二解码层输出的维数均等于所述第二词袋模型的词典中单个文字的数目,所述第二编码层的输出作为所述第二解码层的输入;
    将第二降维训练样本输入所述第二降维模型,以获取所述第二解码层的输出向量;
    根据所述第二解码层的输出向量和所述第二降维训练样本之间的损失调整所述第二降维模型的参数;
    若所述输出向量和所述第二降维训练样本之间的损失满足预设的训练条件,将所述第二编码层的输出作为所述二降维模型的输出。
  14. 如权利要求12所述的计算机设备,其中,所述处理器实现所述随机森林模型的训练时,实现:
    获取样本长文本和与所述样本长文本对应的分类标记;
    根据预设规则滤除所述样本长文本中的噪音字符;
    基于所述第一词袋模型,从滤除所述噪音字符的样本长文本提取第一样本特征向量;
    基于所述第二词袋模型,从滤除所述噪音字符的样本长文本提取第二样本特征向量;
    基于所述第一降维模型对所述第一样本特征向量进行降维,以及基于所述第二降维模型对所述第二样本特征向量进行降维;
    将降维后的第一样本特征向量、降维后的第二样本特征向量组合为与所述分类标记对应的训练向量;
    根据若干所述训练向量和与各所述训练向量对应的分类标记对所述随机森林模型进行训练。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中:若所述计算机程序被处理器执行,实现:
    获取待分类的长文本;
    根据预设规则滤除所述长文本中的噪音字符;
    基于第一词袋模型,从滤除所述噪音字符的长文本提取第一词袋特征向量,所述第一词袋模型的词典包括若干词语;
    基于第二词袋模型,从滤除所述噪音字符的长文本提取第二词袋特征向量,所述第二词袋模型的词典包括若干单个文字;
    基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据;
    其中,所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据,具体包括:
    将所述第一词袋特征向量和所述第二词袋特征向量融合;
    将融合后的向量输入训练好的随机森林模型,以获取所述待分类的长文本的类别。
  16. 如权利要求15所述的存储介质,其中,所述处理器实现所述根据预设规则滤除所述长文本中的噪音字符时,用于实现:
    获取预设的停用词库,所述停用词库包括若干停用词;
    若在所述长文本中查找到所述停用词,将所述长文本中的所述停用词删除或者以预设符号替换。
  17. 如权利要求15所述的存储介质,其中,所述处理器实现所述从滤除所述噪音字符的长文本提取第一词袋特征向量时,用于实现:
    初始化全零的第一词袋特征向量,所述第一词袋特征向量中的元素与所述第一词袋模型的词典中的词语一一对应;
    统计各所述词语在滤除所述噪音字符的长文本中出现的次数;
    根据所述词语在所述长文本中出现的次数对所述第一词袋特征向量中对应的元素赋值;
    所述从滤除所述噪音字符的长文本提取第二词袋特征向量时,用于实现:
    初始化全零的第二词袋特征向量,所述第二词袋特征向量中的元素与所述第二词袋模型的词典中的单个文字一一对应;
    统计各所述单个文字在滤除所述噪音字符的长文本中出现的次数;
    根据所述单个文字在所述长文本中出现的次数对所述第二词袋特征向量中对应的元素赋值。
  18. 如权利要求17所述的存储介质,其中,所述处理器实现所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据之前,还用于实现:
    基于第一降维模型对所述第一词袋特征向量进行降维;
    基于第二降维模型对所述第二词袋特征向量进行降维;
    所述基于分类模型,根据所述第一词袋特征向量和第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据时,用于实现:
    基于分类模型,根据降维后的第一词袋特征向量和降维后的第二词袋特征向量对所述待分类的长文本进行分类以得到分类数据。
  19. 如权利要求18所述的存储介质,其中,所述处理器实现所述第一降维模型的训练时,用于实现:
    获取初始的第一降维模型,所述第一降维模型包括第一编码层和第一解码层;其中所述第一编码层输入的维数、所述第一解码层输出的维数均等于所述第一词袋模型的词典中词语的数目,所述第一编码层的输出作为所述第一解码层的输入;
    将第一降维训练样本输入所述第一降维模型,以获取所述第一解码层的输 出向量;
    根据所述第一解码层的输出向量和所述第一降维训练样本之间的损失调整所述第一降维模型的参数;
    若所述输出向量和所述第一降维训练样本之间的损失满足预设的训练条件,将所述第一编码层的输出作为所述一降维模型的输出;
    所述第二降维模型的训练包括:
    获取初始的第二降维模型,所述第二降维模型包括第二编码层和第二解码层;其中所述第二编码层输入的维数、所述第二解码层输出的维数均等于所述第二词袋模型的词典中单个文字的数目,所述第二编码层的输出作为所述第二解码层的输入;
    将第二降维训练样本输入所述第二降维模型,以获取所述第二解码层的输出向量;
    根据所述第二解码层的输出向量和所述第二降维训练样本之间的损失调整所述第二降维模型的参数;
    若所述输出向量和所述第二降维训练样本之间的损失满足预设的训练条件,将所述第二编码层的输出作为所述二降维模型的输出。
  20. 如权利要求17所述的存储介质,其中,所述处理器实现所述随机森林模型的训练时,实现:
    获取样本长文本和与所述样本长文本对应的分类标记;
    根据预设规则滤除所述样本长文本中的噪音字符;
    基于所述第一词袋模型,从滤除所述噪音字符的样本长文本提取第一样本特征向量;
    基于所述第二词袋模型,从滤除所述噪音字符的样本长文本提取第二样本特征向量;
    基于所述第一降维模型对所述第一样本特征向量进行降维,以及基于所述第二降维模型对所述第二样本特征向量进行降维;
    将降维后的第一样本特征向量、降维后的第二样本特征向量组合为与所述分类标记对应的训练向量;
    根据若干所述训练向量和与各所述训练向量对应的分类标记对所述随机森林模型进行训练。
PCT/CN2019/117706 2019-04-04 2019-11-12 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质 WO2020199595A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910268933.1 2019-04-04
CN201910268933.1A CN110096591A (zh) 2019-04-04 2019-04-04 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020199595A1 true WO2020199595A1 (zh) 2020-10-08

Family

ID=67444259

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117706 WO2020199595A1 (zh) 2019-04-04 2019-11-12 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110096591A (zh)
WO (1) WO2020199595A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096591A (zh) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质
CN110895557B (zh) * 2019-11-27 2022-06-21 广东智媒云图科技股份有限公司 基于神经网络的文本特征判断方法、装置和存储介质
CN111143551A (zh) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 文本预处理方法、分类方法、装置及设备
CN111338683A (zh) * 2020-02-04 2020-06-26 北京邮电大学 一种算法类程序代码分类方法、装置、设备及介质
CN113626587B (zh) * 2020-05-08 2024-03-29 武汉金山办公软件有限公司 一种文本类别识别方法、装置、电子设备及介质
CN111680132B (zh) * 2020-07-08 2023-05-19 中国人民解放军国防科技大学 一种用于互联网文本信息的噪声过滤和自动分类方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014198595A1 (fr) * 2013-06-14 2014-12-18 Proxem Procede de classification thematique automatique d'un fichier de texte numerique
CN106502989A (zh) * 2016-10-31 2017-03-15 东软集团股份有限公司 情感分析方法及装置
CN107357895A (zh) * 2017-01-05 2017-11-17 大连理工大学 一种基于词袋模型的文本表示的处理方法
CN107679144A (zh) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 基于语义相似度的新闻语句聚类方法、装置及存储介质
CN109213843A (zh) * 2018-07-23 2019-01-15 北京密境和风科技有限公司 一种垃圾文本信息的检测方法及装置
CN110096591A (zh) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396724B2 (en) * 2013-05-29 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
CN106951498A (zh) * 2017-03-15 2017-07-14 国信优易数据有限公司 文本聚类方法
CN108595590A (zh) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 一种基于融合注意力模型的中文文本分类方法
CN108829818B (zh) * 2018-06-12 2021-05-25 中国科学院计算技术研究所 一种文本分类方法
CN108959246B (zh) * 2018-06-12 2022-07-12 北京慧闻科技(集团)有限公司 基于改进的注意力机制的答案选择方法、装置和电子设备
CN109165284B (zh) * 2018-08-22 2020-06-16 重庆邮电大学 一种基于大数据的金融领域人机对话意图识别方法
CN109408818B (zh) * 2018-10-12 2023-04-07 平安科技(深圳)有限公司 新词识别方法、装置、计算机设备及存储介质
CN109117472A (zh) * 2018-11-12 2019-01-01 新疆大学 一种基于深度学习的维吾尔文命名实体识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014198595A1 (fr) * 2013-06-14 2014-12-18 Proxem Procede de classification thematique automatique d'un fichier de texte numerique
CN106502989A (zh) * 2016-10-31 2017-03-15 东软集团股份有限公司 情感分析方法及装置
CN107357895A (zh) * 2017-01-05 2017-11-17 大连理工大学 一种基于词袋模型的文本表示的处理方法
CN107679144A (zh) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 基于语义相似度的新闻语句聚类方法、装置及存储介质
CN109213843A (zh) * 2018-07-23 2019-01-15 北京密境和风科技有限公司 一种垃圾文本信息的检测方法及装置
CN110096591A (zh) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LUV_G.EM: "Discrete text representation (1): bag of words model (bag of words)", 16 March 2019 (2019-03-16), pages 1 - 5, XP055740410, Retrieved from the Internet <URL:https://www.cnblogs.com/Luv-GEM/p/10543612.html> *

Also Published As

Publication number Publication date
CN110096591A (zh) 2019-08-06

Similar Documents

Publication Publication Date Title
WO2020199595A1 (zh) 基于词袋模型的长文本分类方法、装置、计算机设备及存储介质
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
WO2021068352A1 (zh) Faq问答对自动构建方法、装置、计算机设备及存储介质
WO2020082560A1 (zh) 文本关键词提取方法、装置、设备及计算机可读存储介质
CN110069709B (zh) 意图识别方法、装置、计算机可读介质及电子设备
CN109815336B (zh) 一种文本聚合方法及系统
US20160321541A1 (en) Information processing method and apparatus
US11023766B2 (en) Automatic optical character recognition (OCR) correction
CN110008309B (zh) 一种短语挖掘方法及装置
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN110442725B (zh) 实体关系抽取方法及装置
JP2010250814A (ja) 品詞タグ付けシステム、品詞タグ付けモデルのトレーニング装置および方法
US20140032207A1 (en) Information Classification Based on Product Recognition
WO2017075017A1 (en) Automatic conversation creator for news
US11934781B2 (en) Systems and methods for controllable text summarization
CN112580328A (zh) 事件信息的抽取方法及装置、存储介质、电子设备
US11790174B2 (en) Entity recognition method and apparatus
US11983183B2 (en) Techniques for training machine learning models using actor data
CN111091001B (zh) 一种词语的词向量的生成方法、装置及设备
WO2021227951A1 (zh) 前端页面元素的命名
CN110874408A (zh) 模型训练方法、文本识别方法、装置及计算设备
CN114861004A (zh) 一种社交事件检测方法、装置及系统
JP7333490B1 (ja) 音声信号に関連するコンテンツを決定する方法、コンピューター可読保存媒体に保存されたコンピュータープログラム及びコンピューティング装置
CN115879446B (zh) 文本处理方法、深度学习模型训练方法、装置以及设备
CN117971357B (zh) 有限状态自动机验证方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19922530

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19922530

Country of ref document: EP

Kind code of ref document: A1