CN108875072B - Text classification method, device, equipment and storage medium - Google Patents

Text classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN108875072B
CN108875072B CN201810729166.5A CN201810729166A CN108875072B CN 108875072 B CN108875072 B CN 108875072B CN 201810729166 A CN201810729166 A CN 201810729166A CN 108875072 B CN108875072 B CN 108875072B
Authority
CN
China
Prior art keywords
text
distribution
classification
word
continuous multiple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810729166.5A
Other languages
Chinese (zh)
Other versions
CN108875072A (en
Inventor
陈立
杨俊�
王珵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201810729166.5A priority Critical patent/CN108875072B/en
Publication of CN108875072A publication Critical patent/CN108875072A/en
Application granted granted Critical
Publication of CN108875072B publication Critical patent/CN108875072B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text classification method, a text classification device, text classification equipment and a storage medium. Learning the distribution representation of each single character and/or continuous multiple characters of the first text in the first text set in an unsupervised machine learning mode; aiming at each second text in at least part of second texts in the second text set, obtaining the distribution representation of each single character and/or continuous multiple characters in the second text based on the learned distribution representation of the single character and/or continuous multiple characters in the first text set; at least taking the distribution representation of each single character and/or continuous multiple characters in the second text as the training sample characteristics, taking the real category of the second text as the training sample mark, and training by using a supervised machine learning mode to obtain a classification model; and predicting the category of the first text by using the classification model. Therefore, the method and the device can effectively classify the first text under the condition that the first text in the first text set lacks the labeling data.

Description

Text classification method, device, equipment and storage medium
Technical Field
The present invention relates generally to the field of machine learning technology, and more particularly, to a text classification method, apparatus, device, and storage medium.
Background
Aiming at the problem of text classification, a supervised machine learning mode is mainly adopted to train a classification model at present. Namely, a data set is constructed according to a scene, the data set is labeled, and then a classification model is trained according to labeled data. The following problems often exist during training.
On one hand, the training process needs to label the data in the data sets on the premise, and the number of the data sets is generally large, so that the labeling cost is extremely high, and the large-scale training data is difficult to construct in practice.
On the other hand, texts in the data set are often strange, and when word segmentation is performed, wrong word segmentation results are often easily obtained due to the limitation of word libraries, for example, "thousand-hundred degrees" may be segmented into "thousand/hundred degrees", so that interference features are introduced when a model is built, and the model effect is seriously influenced.
Therefore, a new text classification scheme is needed to address at least one of the above-mentioned problems.
Disclosure of Invention
An exemplary embodiment of the present invention is to provide a text classification method, apparatus, device, and storage medium to solve at least one of the above-mentioned problems.
According to a first aspect of the present invention, a text classification method is provided, including: learning the distribution representation of each single character and/or continuous multiple characters of the first text in the first text set in an unsupervised machine learning mode; aiming at each second text in at least part of second texts in the second text set, obtaining the distribution representation of each single character and/or continuous multiple characters in the second text based on the learned distribution representation of the single character and/or continuous multiple characters in the first text set; at least taking the distribution representation of each single character and/or continuous multiple characters in the second text as the training sample characteristics, taking the real category of the second text as the training sample mark, and training by using a supervised machine learning mode to obtain a classification model; and predicting the category of the first text by using the classification model.
Optionally, the first text is the same or similar in content to the second text; and/or the number of first texts in the first text set is greater than the number of second texts in the second text set; and/or the data distribution of the first text in the first text set is different from the data distribution of the second text in the second text set.
Optionally, the distribution is characterized as a word vector.
Optionally, the consecutive multi-words comprise consecutive multi-words of different word numbers, and the distribution characteristic of each consecutive multi-word is equal to the sum of the distribution characteristics of the individual single words in the consecutive multi-words.
Optionally, the step of predicting the category of the first text using the classification model includes: and at least using the distribution representation of each single word and/or continuous multiple words in the first text as a prediction sample characteristic, and predicting the category of the first text by using a classification model.
Optionally, the text classification method further includes: performing word segmentation on the second text, and acquiring one-hot characteristics of each segmented word, wherein the step of characterizing the distribution of each single word and/or continuous multiple words in the second text as training sample characteristics at least comprises the following steps: and taking the distribution characteristics of each single character and/or continuous multiple characters in the second text and the one-hot characteristics of each participle in the second text as training sample characteristics.
Optionally, the step of predicting the category of the first text using the classification model includes: performing word segmentation on the first text, and acquiring one-hot characteristics of each word segmentation; and predicting the category of the first text by using a classification model by taking the distribution characteristics of each single character and/or continuous multiple characters in the first text and the one-hot characteristics of each participle in the first text as prediction sample characteristics.
Optionally, the text classification method further includes: and mapping the classification result of the classification model according to the classification requirement of the first text.
Optionally, the first text and the second text are both business names, and the method is used for classifying the business names in the first text set with respect to business categories.
According to the second aspect of the present invention, there is also provided a text classification apparatus, comprising: the learning module is used for learning the distribution representation of each single character and/or continuous multiple characters of the first text in the first text set in an unsupervised machine learning mode; the acquisition module is used for acquiring the distribution representation of each single character and/or continuous multiple characters in the second text based on the learned distribution representation of the single character and/or continuous multiple characters in the first text set aiming at each second text in at least part of the second text in the second text set; the training module is used for taking the distribution representation of each single character and/or continuous multiple characters in the second text as the training sample characteristics, taking the real category of the second text as the training sample mark, and training by using a supervised machine learning mode to obtain a classification model; and a prediction module for predicting the category of the first text using the classification model.
Optionally, the content of the first text is the same as or similar to the content of the second text, and/or the number of the first texts in the first text set is larger than the number of the second texts in the second text set, and/or the data distribution of the first texts in the first text set is different from the data distribution of the second texts in the second text set.
Optionally, the distribution is characterized as a word vector.
Optionally, the consecutive multi-words comprise consecutive multi-words of different word numbers, and the distribution characteristic of each consecutive multi-word is equal to the sum of the distribution characteristics of the individual single words in the consecutive multi-words.
Optionally, the prediction module predicts the category of the first text by using a classification model, with at least a distribution characterization of each single word and/or continuous multiple words in the first text as a prediction sample feature.
Optionally, the text classification apparatus further includes: and the training module takes the distribution characteristics of each single character and/or continuous multiple characters in the second text and the one-hot characteristics of each participle in the second text as training sample characteristics.
Optionally, the feature obtaining module performs word segmentation on the first text, obtains a one-hot feature of each segmented word, and the prediction module predicts the category of the first text by using a classification model with a distribution feature of each single word and/or continuous multiple words in the first text and the one-hot feature of each segmented word in the first text as prediction sample features.
Optionally, the text classification apparatus further includes: and the mapping module is used for mapping the classification result of the classification model according to the classification requirement of the first text.
Optionally, the first text and the second text are both business names, and the device is configured to classify the business names in the first text set with respect to business categories.
According to a third aspect of the present invention, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to the first aspect of the invention.
According to a fourth aspect of the invention, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as set forth in the first aspect of the invention.
For a first text set to be classified, the method learns the intrinsic characteristics (such as distribution representation) of single characters and/or continuous multiple characters in the first text set in an unsupervised machine learning mode, migrates the learned characteristics into a second text set with known real categories, carries out model training, and classifies the first text in the first text set by using a model obtained by training. Therefore, the method and the device can effectively classify the first text under the condition that the first text in the first text set lacks the labeling data. .
Drawings
The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:
fig. 1 illustrates a flowchart of a text classification method according to an exemplary embodiment of the present invention.
Fig. 2 shows a flowchart of an implementation of a text classification method according to another exemplary embodiment of the present invention.
Fig. 3 illustrates a block diagram of a text classification apparatus according to an exemplary embodiment of the present invention.
Fig. 4 shows a schematic structural diagram of a computing device for data processing that can be used to implement the text classification method described above according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.
Fig. 1 illustrates a flowchart of a text classification method according to an exemplary embodiment of the present invention.
Referring to fig. 1, in step S110, a distribution representation of individual single words and/or continuous multiple words of a first text in a first text set is learned through an unsupervised machine learning manner.
The first text set is a set of texts to be classified, and the texts in the first text set can be called as first texts. Here, the features of the data (i.e., text) in the first text set are mainly learned by using an unsupervised machine learning manner (e.g., a neural network-based manner). Specifically, the distribution representation of each single word in the first text set may be learned by using the single word as a granularity, the distribution representation of each continuous multiple word in the first text set may be learned by using the continuous multiple words as a granularity, and the distribution representations of each single word and each continuous multiple word in the first text set may be learned simultaneously by using the single word and the continuous multiple words as granularities, respectively.
The consecutive multi-words may comprise consecutive multi-words of different word numbers, such as consecutive two words, consecutive three words, etc. The distribution representation is the intrinsic characteristics of each single word and/or continuous multiple words in the first text set learned in an unsupervised machine learning mode. As an example, the distribution characterization may be a Word Vector (Word Embedding), the Word Vector is a vectorized representation of words, and words in a natural language may be converted into Dense vectors (Dense vectors) that can be understood by a computer by using a Word Vector technology, where the Word Vector and a generation manner thereof are mature technologies and are not described herein again.
Under the condition of simultaneously learning the distribution representation of each single character and the continuous multiple characters in the first text set, the distribution representation of the single character and the distribution representation of the continuous multiple characters can be independently learned by using an unsupervised machine learning mode, or the distribution representation of the single character can be learned by using the unsupervised machine learning mode firstly, and then the distribution representation of the continuous multiple characters is obtained based on the distribution representation of the single character, for example, the distribution representation of each continuous multiple character can be equal to the sum of the distribution representations of each single character in the continuous multiple characters.
For example, a word vector (i.e., char2vec) model may be constructed at a single word granularity to obtain a distribution characterization of each single word in the first text set, and another char2vec model may be constructed at a same time at a multiple consecutive word granularity to obtain a distribution characterization of each multiple consecutive words in the first text set. In addition, a char2vec model may be constructed by using single characters as the granularity to obtain the distribution representation of each single character in the first text set, and then the distribution representation of consecutive multi-characters may be constructed based on the distribution representation of the single characters, for example, the distribution representation of consecutive double-characters may be constructed in a manner of r (ab) ═ r (a) + r (b), where r (ab) is the distribution representation of consecutive double-characters ab, r (a) is the distribution representation of single character a, and r (b) is the distribution representation of single character b, and for consecutive multi-characters (e.g., consecutive three characters) of other character numbers, the distribution representation may also be constructed by using this construction manner, which is not described herein again.
In step S120, for each second text in at least a part of the second texts in the second text set, based on the learned distribution representation of the corresponding single word and/or continuous multiple words in the first text set, a distribution representation of each single word and/or continuous multiple words in the second text is obtained.
In the present invention, the second text set may be a set of texts of which real categories are known, and the texts in the second text set may be referred to as second texts. Preferably, a text whose content is the same as or similar to that of the first text and whose true category is known may be used as the second text (here, the same or similar in content may reflect the correlation between the two text sets, i.e., the two text sets are not completely unrelated contents, may be partially the same or similar, and may even be a subset of one text set and another text set); and/or the number of second texts in the second corpus may be less than the number of first texts in the first corpus of texts (e.g., the number of second texts may be much less than the number of first texts); and/or the data distribution of the second set of text may be different from the first set of text (e.g., the distribution of the two sets of text may differ to some degree). For example, the first text set may be a set of business names to be classified, and the second text set may be a set of business names known to the true category acquired through other channels.
The method is mainly based on the idea of transfer learning, and the distribution representation of the single characters and/or continuous multiple characters learned in the first text set is transferred into the second text set, so that the distribution representation of each single character and/or continuous multiple character in each second text in at least part of second texts is obtained.
Specifically, after the distribution representation of each single word and/or continuous multiple words in the first text set is learned by taking the single word and/or continuous multiple words as granularity, for each second text in at least part of the second texts, the distribution representation learned by the corresponding same single word and/or continuous multiple words in the first text set can be obtained by means of searching (e.g., table searching) by taking the single word and/or continuous multiple words as granularity as well.
It should be noted that, in the case that a certain single word or certain continuous multiple words in the second text do not appear in the first text set, the learned distribution representation of such single word or continuous multiple words in the first text set cannot be found by means of searching. In this case, such a single word or continuous multiple words may be replaced with a special symbol (e.g., a specific vector) that may be used as a distribution representation of such single words or continuous multiple words; or such single words or consecutive multiple words may also be ignored; or the second text with such single or consecutive multiple words may also be discarded.
In step S130, at least the distribution representation of each single word and/or continuous multiple words in the second text is used as a training sample feature, the real category of the second text is used as a training sample label, and a supervised machine learning manner is used for training to obtain a classification model.
In an embodiment of the present invention, only the distribution representation of each single word and/or continuous multiple words in the second text may be used as the training sample feature, the real category of the second text is used as the training sample label, and the supervised machine learning manner is used for training to obtain the classification model.
However, exemplary implementations of the present invention are not so limited, and the training samples may include any other features in addition to the features including the distribution characterization as the training samples
For example, in another embodiment of the present invention, a word segmentation process may be further performed on the second text, and a one-hot (one-hot code) feature of each segmented word is obtained, and then, a distribution feature of each single word and/or continuous multiple words in the second text and the one-hot feature of each segmented word in the second text may be used as a training sample feature, and a real category of the second text is used as a training sample label, and a supervised machine learning manner is used for training, so as to obtain a classification model. The concept and encoding method of one-hot feature are well-known in the art and will not be described herein.
In step S140, the classification model is used to predict the category of the first text. Here, the first text may be feature extracted according to a feature extraction rule of the classification model to form a corresponding prediction sample, that is, the prediction sample may be generated based on the first text in a manner completely consistent with the feature extraction of the training sample. After the prediction samples are provided for the classification model, corresponding prediction results can be obtained, and the prediction results can be directly used as classification results or can be further processed to obtain final classification results.
When the classification model is trained, the features of the training samples are obtained by transferring the distribution representation of the single words and/or continuous multiple words learned in the first text set, and accordingly, the classification model obtained by training can be applied to the first text set, that is, the classification of the first text in the first text set can be predicted by using the trained classification model.
Specifically, the classification model can be used for predicting the category of the first text at least by using the distribution representation of each single word and/or continuous multiple words in the first text as the prediction sample feature. As described above, in training the classification model, the training sample features may include distribution representations of individual single words and/or continuous multiple words in the second text, and may also include one-hot features. Therefore, when the classification model is used to predict the category of the first text, the feature configuration of the first text is different according to the different configuration of the features of the training samples.
For example, when the classification model is trained, in the case that the distribution representation of each single word and/or continuous multiple words in the second text is used as the training sample feature, the classification model may be used to predict the category of the first text by using the distribution representation of each single word and/or continuous multiple words in the first text as the prediction sample feature. For another example, in the case that the distribution characteristic of each single character and/or continuous multiple character in the second text and the one-hot characteristic of each participle in the second text are taken as training sample characteristics, the word segmentation process may be performed on the first text, and the one-hot characteristic of each participle may be obtained, and then the distribution characteristic of each single character and/or continuous multiple character in the first text and the one-hot characteristic of each participle in the first text may be taken as prediction sample characteristics. Preferably, the one-hot feature of the participle in the first text is encoded in the same way as the one-hot feature of the participle in the second text, that is, the one-hot feature of the same participle in the first text is the same as the one-hot feature of the same participle in the second text.
As described above, the first set of text is a set of texts to be classified, and the second set of text is a set of texts whose true categories are known. Because the real category of the second text in the second text set may have a certain difference from the classification requirement of the user for the first text set, when the trained classification model is used to predict the category of the first text, the predicted classification result may not well meet the specific classification requirement. In view of this, the present invention provides that the classification result of the classification model may be mapped according to the classification requirement of the first text, so that the result obtained by mapping can meet the classification requirement. For example, assuming that the first text and the second text are both business names, the classification requirement is to classify the business names in the first text set with respect to business categories, and the actual categories of the second text in the second text set are "Chinese food", "western food", "medical health", "fruit and vegetable boredom", and so on. When the category of the first text is predicted by using the trained classification model, and the obtained prediction result is "Chinese meal" or "western meal", the classification result can be mapped to "restaurant" or "restaurant" to obtain a mapping result of the business category of the business name in the first text set.
In summary, for a first text set to be classified, the present invention learns the intrinsic features (such as distribution characterization) of single words and/or continuous multiple words in the first text set in an unsupervised machine learning manner, migrates the learned features into a second text set with known real categories, performs model training, and classifies the first text in the first text set by using the trained model. Therefore, under the condition that the first text in the first text set lacks of the labeling data, the first text can be effectively classified based on the method and the device.
Furthermore, the invention learns the internal characteristics of the data in the first text set by taking single characters and/or continuous multiple characters as the granularity and transfers the internal characteristics to the second text set, thereby not only enriching the sample characteristics, but also repairing the problem of word self-partitioning errors to a certain extent.
So far, the basic implementation principle and the flow of the text classification method of the present invention are explained with reference to fig. 1. In the following, the classification of business names with respect to business categories is taken as an example, and it should be understood that the text classification method of the present invention can also be applied to other text classification scenarios of non-business names, such as topic classification of news headlines, category classification of financial transaction descriptions, etc.
Fig. 2 shows a flowchart of an implementation of a text classification method according to another exemplary embodiment of the present invention.
First, a plurality of noun concepts related to the present embodiment will be described.
Transfer learning: when the original problem to be learned cannot be directly solved, a similar problem (namely a target problem) which can be solved is found, and the solution scheme is transferred to the original problem in a certain form.
Original problems: the original question to be learned belongs to a text classification question, which may be a classification question for business categories for business names, for example.
The target problem is as follows: a question to be learned.
Original domain: the problem-solving data and task set belongs to label-free data and mass, and corresponds to the original problem, and the first text set can be regarded as original domain data.
Target domain: the set of data and tasks used to perform model training, which in the present invention is labeled data, of moderate number, corresponding to the target problem, the second set of text mentioned above can be considered as target domain data.
As shown in fig. 2, some tagged (i.e., true category known) merchant data may first be obtained from public data, such as data in certain merchant classification games. These data tend to be small in size (tens or hundreds of thousands) and are not in accordance with the distribution of scene data to be classified (i.e., raw domain data). These data may be used as target domain data.
For massive merchant data (original domain data) to be classified, each piece of data can be segmented according to word granularity, for example, a "Jiangwei lobster hall" can be segmented into a sequence form of "Jiang/Wei/Long/shrimp/hall". According to the data to be classified obtained by the word granularity segmentation, a char2vec model can be constructed, and the distribution (distributed) representation of each word is obtained and used as a feature for standby. From the obtained word representations, a distribution representation of consecutive multiple words may also be constructed, for example, a distribution representation of double words is constructed in a manner of r (ab) ═ r (a) + r (b), where r (ab) is the distribution representation of double words ab, (a) is the distribution representation of single words a, and r (b) is the distribution representation of single words b. In addition, word segmentation processing can be performed on each piece of data in the original domain data, and one-hot characteristics of each word segmentation can be obtained. The distribution representation of the single character, the distribution representation of the double character and the one-hot feature can be used as feature expressions of original domain data.
After the feature expression of the original domain data is obtained, the learned distribution characterization of the single word and the distribution characterization of the multiple words in the original domain data can be migrated to the target domain data, for example, the distribution characterization of the single word and the distribution characterization of the double word can be migrated. Features in the target domain data include three parts: (a) performing word segmentation on a target domain sample, and performing one-hot characterization on word granularity; (b) segmenting a target domain according to characters, and representing the distribution of the characters; (b) and segmenting the target domain according to characters, and combining adjacent multiple characters to obtain the distribution representation of continuous multiple characters. The purpose of introducing (b) and (c) is to enrich the characteristics through the combination of single words and double words and avoid the condition of insufficient characteristics caused by word segmentation errors.
Therefore, the characteristics of each piece of data in the target domain data can be used as sample characteristics, and the labels can be used as sample labels, so that a supervised learning training set can be obtained. The supervised machine learning approach may then be used for training to derive a classification model.
The classification model obtained by training can be applied to original domain data to predict the category of each piece of data in the original domain data, and the output of the classification model can be mapped according to actual requirements so as to realize classification of merchants to be classified in the original domain data.
The method mainly comprises two parts, wherein the first part is characteristic learning, and is used for learning the intrinsic rule of unmarked scene data (namely original domain data) and constructing the characteristic; and the second part uses transfer learning to apply the learned characteristics to the public data with labels (namely target domain data) for model training, and the trained model can be used for predicting the category of the non-labeled scene data, so that effective classification is realized under the condition of lacking the scene labeled data.
Taking the business name classification as an example, a relatively small amount of public data which are related to the business name field and are marked can be obtained, then the data in the scene without marking are used for feature learning, for example, the distribution representation of single characters and/or multiple characters can be learned from the sending of single characters and/or multiple characters, the distribution representation of the single characters and/or multiple characters is migrated to a small amount of marked data outside the scene, the training of a classification model is carried out, and the model is applied to the data in the scene after the training is finished, so that the effective classification of the business names is realized.
The text classification method of the present invention can also be implemented as a text classification device. Fig. 3 illustrates a block diagram of a text classification apparatus according to an exemplary embodiment of the present invention. The functional blocks of the text classification apparatus may be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 3 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.
In the following, functional modules that the text classification device can have and operations that each functional module can perform are briefly described, and details related thereto may be referred to the above description, and are not repeated here.
Referring to fig. 3, the text classification apparatus 300 includes a learning module 310, an obtaining module 320, a training module 330, and a prediction module 340. As an example, the first text and the second text may both be merchant names, and the apparatus may be configured to classify the merchant names in the first text set with respect to business categories.
The learning module 310 is configured to learn distribution representations of individual single words and/or continuous multiple words of the first text in the first text set in an unsupervised machine learning manner. The obtaining module 320 is configured to obtain a distribution representation of each single word and/or continuous multiple words in each second text in at least part of the second texts in the second text set based on the learned distribution representation of the single word and/or continuous multiple words in the first text set. The training module 330 is configured to use at least the distribution representation of each single word and/or continuous multiple words in the second text as a training sample feature, use the real category of the second text as a training sample label, and perform training in a supervised machine learning manner to obtain a classification model. The prediction module 340 is configured to predict a category of the first text using the classification model.
The content of the first text and the second text may be the same or similar, and/or the number of the first text in the first set of texts may be greater than the number of the second text in the second set of texts, and/or the data distribution of the first text in the first set of texts may be different from the data distribution of the second text in the second set of texts. Wherein, the distribution characterization can be a word vector. The consecutive multi-words may comprise consecutive multi-words of different word numbers and the distribution characteristic of each consecutive multi-word is equal to the sum of the distribution characteristics of the individual single words in the consecutive multi-words.
As an example, the prediction module 340 may predict the category of the first text by using a classification model with at least a distribution characterization of each single word and/or continuous multiple words in the first text as a prediction sample feature.
As shown in fig. 3, the text classification apparatus 300 may further optionally include a feature obtaining module 350 shown by a dashed box in the figure. The feature obtaining module 350 is configured to perform word segmentation on the second text, and obtain a one-hot feature of each segmented word. The training module 330 may use a distribution characteristic of each single word and/or continuous multiple words in the second text and a one-hot characteristic of each segmented word in the second text as a training sample characteristic.
Moreover, the feature obtaining module 350 may perform word segmentation on the first text and obtain a one-hot feature of each segmented word, and the predicting module 340 may predict the category of the first text by using a classification model, with the distribution feature of each single word and/or continuous multiple words in the first text and the one-hot feature of each segmented word in the first text as prediction sample features.
As shown in fig. 3, the text classification apparatus 300 may further optionally include a mapping module 360 shown by a dashed box. The mapping module 360 is configured to map the prediction result of the classification model according to the classification requirement of the first text.
Fig. 4 shows a schematic structural diagram of a computing device for data processing that can be used to implement the text classification method described above according to an exemplary embodiment of the present invention.
Referring to fig. 4, computing device 400 includes memory 410 and processor 420.
The processor 420 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 420 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 420 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory 410 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 420 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 410 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 410 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 410 has stored thereon executable code that, when executed by the processor 420, may cause the processor 420 to perform the text classification methods described above.
The text classification method, apparatus and computing device according to the present invention have been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (20)

1. A method of text classification, comprising:
learning the distribution representation of each single character and/or continuous multiple characters of the first text in the first text set in an unsupervised machine learning mode, wherein the distribution representation is an internal feature;
aiming at each second text in at least part of second texts in a second text set, obtaining the distribution representation of each single character and/or continuous multiple characters in the second text based on the learned distribution representation of the single character and/or continuous multiple characters in the first text set;
at least taking the distribution representation of each single character and/or continuous multiple characters in the second text as a training sample characteristic, taking the real category of the second text as a training sample mark, and training by using a supervised machine learning mode to obtain a classification model; and
and predicting the category of the first text by using a classification model.
2. The text classification method according to claim 1,
the first text and the second text have the same or similar content; and/or
The number of first texts in the first text set is larger than the number of second texts in the second text set; and/or
The data distribution of a first text in the first text set is different from the data distribution of a second text in the second text set.
3. The text classification method according to claim 1,
the distribution is characterized as a word vector.
4. The text classification method according to claim 1,
the consecutive multi-words comprise consecutive multi-words of different word counts, and,
the distribution representation of each continuous multi-word is equal to the sum of the distribution representations of the single words in the continuous multi-word.
5. The method of claim 1, wherein the step of predicting the category of the first text using a classification model comprises:
and at least taking the distribution representation of each single word and/or continuous multiple words in the first text as a prediction sample characteristic, and predicting the category of the first text by using the classification model.
6. The text classification method according to claim 1, further comprising:
performing word segmentation processing on the second text, acquiring one-hot characteristics of each word segmentation,
wherein the step of characterizing at least the distribution of each single word and/or continuous multiple words in the second text as the training sample features comprises: and taking the distribution characteristics of each single character and/or continuous multiple characters in the second text and the one-hot characteristics of each participle in the second text as training sample characteristics.
7. The method of claim 6, wherein the step of predicting the category of the first text using a classification model comprises:
performing word segmentation on the first text, and acquiring one-hot characteristics of each word segmentation;
and predicting the category of the first text by using the classification model by taking the distribution characteristics of each single character and/or continuous multiple characters in the first text and the one-hot characteristics of each word in the first text as prediction sample characteristics.
8. The text classification method according to claim 1, further comprising:
and mapping the classification result of the classification model according to the classification requirement of the first text.
9. The text classification method according to claim 1,
the first text and the second text are both merchant names, and the method is used for classifying the merchant names in the first text set according to business categories.
10. A text classification apparatus, comprising:
the learning module is used for learning the distribution representation of each single character and/or continuous multiple characters of the first text in the first text set in an unsupervised machine learning mode, wherein the distribution representation is an internal feature;
the acquisition module is used for aiming at each second text in at least part of second texts in a second text set, and obtaining the distribution representation of each single character and/or continuous multiple characters in the second text based on the learned distribution representation of the single character and/or continuous multiple characters in the first text set;
the training module is used for taking the distribution representation of each single character and/or continuous multiple characters in the second text as the training sample characteristics, taking the real category of the second text as a training sample mark, and training by using a supervised machine learning mode to obtain a classification model; and
and the prediction module is used for predicting the category of the first text by utilizing the classification model.
11. The text classification apparatus according to claim 10,
the first text and the second text have the same or similar content and/or
The number of first texts in the first text set is greater than the number of second texts in the second text set, and/or
The data distribution of a first text in the first text set is different from the data distribution of a second text in the second text set.
12. The text classification apparatus according to claim 10,
the distribution is characterized as a word vector.
13. The text classification apparatus according to claim 10,
the consecutive multi-words comprise consecutive multi-words of different word counts, and,
the distribution representation of each continuous multi-word is equal to the sum of the distribution representations of the single words in the continuous multi-word.
14. The text classification apparatus according to claim 10,
and the prediction module at least takes the distribution representation of each single character and/or continuous multiple characters in the first text as the prediction sample characteristics, and predicts the category of the first text by using the classification model.
15. The text classification apparatus according to claim 10, further comprising:
a feature obtaining module, configured to perform word segmentation on the second text, and obtain a one-hot feature of each word segment,
the training module takes the distribution characteristics of each single word and/or continuous multiple words in the second text and the one-hot characteristics of each participle in the second text as training sample characteristics.
16. The text classification apparatus according to claim 15,
the feature acquisition module performs word segmentation processing on the first text and acquires one-hot features of each segmented word,
and the prediction module predicts the category of the first text by using the classification model by taking the distribution characteristics of each single character and/or continuous multiple characters in the first text and the one-hot characteristics of each participle in the first text as prediction sample characteristics.
17. The text classification apparatus according to claim 10, further comprising:
and the mapping module is used for mapping the classification result of the classification model according to the classification requirement of the first text.
18. The text classification apparatus according to claim 10,
the first text and the second text are both merchant names, and the device is used for classifying the merchant names in the first text set according to business categories.
19. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-9.
20. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.
CN201810729166.5A 2018-07-05 2018-07-05 Text classification method, device, equipment and storage medium Active CN108875072B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810729166.5A CN108875072B (en) 2018-07-05 2018-07-05 Text classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810729166.5A CN108875072B (en) 2018-07-05 2018-07-05 Text classification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108875072A CN108875072A (en) 2018-11-23
CN108875072B true CN108875072B (en) 2022-01-14

Family

ID=64298947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810729166.5A Active CN108875072B (en) 2018-07-05 2018-07-05 Text classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108875072B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614494B (en) * 2018-12-29 2021-10-26 东软集团股份有限公司 Text classification method and related device
CN110188798B (en) * 2019-04-28 2023-08-08 创新先进技术有限公司 Object classification method and model training method and device
EP3948590A4 (en) 2019-05-01 2022-11-16 Microsoft Technology Licensing, LLC Method and system of utilizing unsupervised learning to improve text to content suggestions
EP3963474A4 (en) 2019-05-01 2022-12-14 Microsoft Technology Licensing, LLC Method and system of utilizing unsupervised learning to improve text to content suggestions
CN111406198B (en) 2020-02-24 2021-02-19 长江存储科技有限责任公司 System and method for semiconductor chip surface topography metrology
WO2021168610A1 (en) 2020-02-24 2021-09-02 Yangtze Memory Technologies Co., Ltd. Systems having light source with extended spectrum for semiconductor chip surface topography metrology
CN111356897B (en) * 2020-02-24 2021-02-19 长江存储科技有限责任公司 System and method for semiconductor chip surface topography metrology
CN111356896B (en) 2020-02-24 2021-01-12 长江存储科技有限责任公司 System and method for semiconductor chip surface topography metrology
CN111444686B (en) * 2020-03-16 2023-07-25 武汉中科医疗科技工业技术研究院有限公司 Medical data labeling method, medical data labeling device, storage medium and computer equipment
CN113761181B (en) * 2020-06-15 2024-06-14 北京京东振世信息技术有限公司 Text classification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326214A (en) * 2016-08-29 2017-01-11 中译语通科技(北京)有限公司 Method and device for cross-language emotion analysis based on transfer learning
CN107463658A (en) * 2017-07-31 2017-12-12 广州市香港科大霍英东研究院 File classification method and device
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897262A (en) * 2016-12-09 2017-06-27 阿里巴巴集团控股有限公司 A kind of file classification method and device and treating method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326214A (en) * 2016-08-29 2017-01-11 中译语通科技(北京)有限公司 Method and device for cross-language emotion analysis based on transfer learning
CN107463658A (en) * 2017-07-31 2017-12-12 广州市香港科大霍英东研究院 File classification method and device
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《迁移学习在文本分类中的应用研究》;夏禹;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》;20180315(第03期);第I138-2174页 *

Also Published As

Publication number Publication date
CN108875072A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN108875072B (en) Text classification method, device, equipment and storage medium
CN109117848B (en) Text line character recognition method, device, medium and electronic equipment
Hu et al. Segmentation from natural language expressions
CN109902271B (en) Text data labeling method, device, terminal and medium based on transfer learning
CN107423278B (en) Evaluation element identification method, device and system
WO2021258479A1 (en) Graph neural network-based method, system, and apparatus for detecting network attack
CN107168992A (en) Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
JP5534280B2 (en) Text clustering apparatus, text clustering method, and program
CN106852185A (en) Parallelly compressed encoder based on dictionary
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN110533018A (en) A kind of classification method and device of image
CN114596566B (en) Text recognition method and related device
CN112101526A (en) Knowledge distillation-based model training method and device
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
CN111475651B (en) Text classification method, computing device and computer storage medium
Lyu et al. The early Japanese books reorganization by combining image processing and deep learning
Nemade et al. Image segmentation using convolutional neural network for image annotation
Lopes et al. Exploring bert for aspect extraction in portuguese language
Liao et al. Doctr: Document transformer for structured information extraction in documents
CN112989043A (en) Reference resolution method and device, electronic equipment and readable storage medium
CN112395858A (en) Multi-knowledge point marking method and system fusing test question data and answer data
CN117727043A (en) Training and image retrieval methods, devices and equipment of information reconstruction model
CN116701637A (en) Zero sample text classification method, system and medium based on CLIP
He et al. Deep visual semantic embedding with text data augmentation and word embedding initialization
Ali et al. Comparison Performance of Long Short-Term Memory and Convolution Neural Network Variants on Online Learning Tweet Sentiment Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant