CN108875045B

CN108875045B - Method of performing machine learning process for text classification and system thereof

Info

Publication number: CN108875045B
Application number: CN201810684574.3A
Authority: CN
Inventors: 张滔; 白杨
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-06-04
Anticipated expiration: 2038-06-28
Also published as: CN108875045A

Abstract

A method of performing a machine learning process for text classification and a system thereof are provided. The method comprises the following steps: acquiring a label text set; providing the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm; under the condition that a user selects a specific candidate model algorithm in the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, training a classification model by using the labeled text set according to the specific candidate model algorithm; analyzing the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

Description

Method of performing machine learning process for text classification and system thereof

Technical Field

The following description relates to the field of text classification, and more particularly, to a method of performing a machine learning process for text classification and a system thereof.

Background

In the real world, text data is very abundant, and a user needs to process the text data by using different technical means in various scenes. The text classification refers to automatic labeling of each text in the text set according to a certain classification system or standard. Traditional solutions based on manual work or rules cannot process large-scale data volume and text classification under complex scenes, and machine learning (including deep learning and the like) is very suitable for solving the classification problem of large data volume.

However, the machine learning technology has a high technical threshold, and related personnel need to master very specialized machine learning knowledge to find a model algorithm suitable for text classification in a specific scene, which has a high requirement on technical literacy of users.

In addition, a typical processing idea of classifying texts in the prior art is to use a classification algorithm for referencing a representative text data set, that is, a typical representative text set that is the same as or similar to a text set to be classified can be determined, and a model algorithm for classifying texts in the text set to be classified is formed based on the classification algorithm of the typical representative text set. However, this makes the selection of the classification model algorithm have a great relationship with the scene of the text data set (e.g., the content of the text, etc.), such as a comment sentiment classification scene, a news topic classification scene, etc., and if the text to be classified is a new scene, it is still difficult to effectively select the model.

Disclosure of Invention

To solve at least one of the above problems, the present invention provides a method of performing a machine learning process for text classification and a system thereof.

According to an aspect of the present inventive concept, there is provided a method of performing a machine learning process for text classification. The method comprises the following steps: acquiring a label text set; providing the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm; under the condition that a user selects a specific candidate model algorithm in the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, training a classification model by using the labeled text set according to the specific candidate model algorithm; analyzing the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

Optionally, the method may further comprise: and classifying the unlabeled text by using the trained classification model.

Optionally, the step of determining a model algorithm for text classification based on the information may comprise: when the information indicates that the length of the marked text is short, the quantity of the marked text is small and/or the number of the marked types is small, determining a model algorithm which is not based on a neural network as a model algorithm for text classification; and when the information indicates that the length of the labeled text is long, the quantity of the labeled text is large and/or the number of the labeled types is large, determining the model algorithm based on the neural network as the model algorithm for text classification.

Optionally, the step of determining a model algorithm for text classification based on the information may comprise: under the condition that the information indicates that the number of the labeled types is two categories, when the number of the information indicates that the labeled texts is less than 10 ten thousand, determining a naive Bayes classification algorithm based on a polynomial as a model algorithm for text classification; when the information indicates that the number of the marked texts is between 10 ten thousand and 100 ten thousand, determining an algorithm for classifying the texts by using a convolutional neural network as a model algorithm for text classification; and when the information indicates that the number of the marked texts is more than 100 ten thousand, determining an algorithm for classifying the texts by utilizing the shallow network as a model algorithm for text classification.

Optionally, the step of determining a model algorithm for text classification based on the information may comprise: under the condition that the number of the information indication labeling types is more than two and the number of the information indication labeling texts is less than 10 ten thousand or more than 100 ten thousand, determining an algorithm for classifying the texts by utilizing a shallow network as a model algorithm for text classification; when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, determining an algorithm for classifying the texts by using the convolutional neural network as a model algorithm for text classification.

Optionally, the step of training a classification model using the labeled text set according to the determined model algorithm may include: and training a classification model by using the labeled text set according to the determined model algorithm and preset initial hyper-parameters corresponding to the determined model algorithm.

Optionally, the step of training a classification model using the labeled text set according to the determined model algorithm may include: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the determined model algorithm and the preset initial hyper-parameters corresponding to the determined model algorithm.

Optionally, the method may further comprise: in the event that a user selects a particular candidate model algorithm among at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, a control is provided to the user for configuring a hyper-parameter of the particular candidate model algorithm.

Optionally, the step of training a classification model using the labeled text set according to the specific model algorithm may include: and training a classification model by using the labeled text set according to the specific model algorithm and preset initial hyper-parameters corresponding to the specific model algorithm.

Optionally, the step of training a classification model using the labeled text set according to the specific model algorithm may include: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the specific model algorithm and preset initial hyper-parameters corresponding to the specific model algorithm.

According to an aspect of the present inventive concept, there is provided a system that performs a machine learning process for text classification. The system comprises an acquisition unit, an interaction unit and a training unit. The acquisition unit may acquire the set of annotation texts. The interaction unit may provide the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm. The training unit may train a classification model using the labeled text set according to a specific candidate model algorithm in a case where a user selects the specific candidate model algorithm among the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm; or analyzing the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

Optionally, the system may further comprise: and the classification unit is used for classifying the unlabelled texts by using the trained classification model.

Optionally, when the information indicates that the length of the labeled text is short, the quantity of the labeled text is small, and/or the number of the labeled types is small, the training unit determines the non-neural network-based model algorithm as a model algorithm for text classification; when the information indicates that the length of the labeled text is long, the quantity of the labeled text is large and/or the quantity of the labeled types is large, the training unit determines the model algorithm based on the neural network as the model algorithm for text classification.

Optionally, in a case that the information indicates that the number of the labeled types is two, when the information indicates that the number of the labeled texts is less than 10 ten thousand, the training unit determines a naive bayesian classification algorithm based on a polynomial as a model algorithm for text classification; when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a convolutional neural network as a model algorithm for text classification; when the information indicates that the number of the marked texts is more than 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a shallow network as a model algorithm for text classification.

Optionally, in the case of multiple classifications where the number of the information indication label types is greater than two, when the number of the information indication label texts is less than 10 ten thousand or greater than 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a shallow network as a model algorithm for text classification; when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, the training unit determines an algorithm for classifying the texts by using the convolutional neural network as a model algorithm for text classification.

Optionally, the training unit trains a classification model by using the labeled text set according to a determined model algorithm and a preset initial hyper-parameter corresponding to the determined model algorithm.

Optionally, the training unit trains a classification model by using the labeled text set according to a determined model algorithm and a preset initial hyper-parameter corresponding to the determined model algorithm through a corresponding automatic parameter adjusting scheme.

Optionally, the interaction unit further provides a user with a control for configuring a hyper-parameter of the particular candidate model algorithm.

Optionally, the training unit trains a classification model by using the labeled text set according to the specific model algorithm and a preset initial hyper-parameter corresponding to the specific model algorithm.

Optionally, the training unit trains a classification model by using the labeled text set according to the specific model algorithm and a preset initial hyper-parameter corresponding to the specific model algorithm through a corresponding automatic parameter adjusting scheme.

According to another aspect of the present inventive concept, there is provided a method of performing a machine learning process for text classification. The method comprises the following steps: acquiring a label text set; analyzing the marked texts in the marked text set to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the quantity of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

According to another aspect of the present inventive concept, there is provided a system that performs a machine learning process for text classification. The system comprises an acquisition unit and a training unit. The acquisition unit may acquire the set of annotation texts. The training unit can analyze the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the quantity of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

According to another aspect of the inventive concept, there is provided a computer-readable storage medium. The computer readable storage medium stores program instructions that, when executed by a processor, cause the processor to perform the method as described above.

According to another aspect of the present inventive concept, there is provided a computing device. The computing device includes: a processor; and a memory storing program instructions that, when executed by the processor, cause the processor to perform the method as described above.

The method and the system for executing the machine learning process aiming at the text classification can automatically and effectively select the proper model algorithm for the text classification problem by separating from the text scene. According to the method and the system for executing the machine learning process aiming at the text classification, disclosed by the invention, for a given labeled data set, a proper model and/or related parameters can be automatically recommended according to the characteristics of the labeled text set, so that a layman without a machine learning technology can effectively complete modeling only by inputting data of the model and expected classification labels without knowing the details of the model or algorithm, the modeling difficulty is greatly reduced, and the classification model training and/or classification prediction can be easily carried out; other candidate model algorithms and the hyper-parameter configuration thereof can be provided, so that professionals with machine learning technology can flexibly control the training process, and the classification effect of the model for text classification is ensured.

Drawings

Fig. 1 is a flow diagram illustrating a method of performing a machine learning process for text classification according to an example embodiment.

Fig. 2 is a block diagram illustrating a system that performs a machine learning process for text classification according to an example embodiment.

Fig. 3A and 3B are schematic diagrams illustrating an operational process of a system performing a machine learning process for text classification according to an example embodiment.

Fig. 4 is a flowchart illustrating a method of performing a machine learning process for text classification according to another example embodiment.

Fig. 5 is a flow diagram illustrating a system that performs a machine learning process for text classification according to another example embodiment.

Detailed Description

The present invention is susceptible to various modifications and embodiments, and it is to be understood that the present invention is not limited to these embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention. For example, the order of operations described herein is merely an example and is not limited to those set forth herein, but may be changed as will become apparent after understanding the disclosure of the present application, except to the extent that operations must occur in a particular order. Moreover, descriptions of features known in the art may be omitted for greater clarity and conciseness. The terminology used in the exemplary embodiments of the present invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the exemplary embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Referring to fig. 1, in step 101, a set of annotation texts is obtained. In one embodiment, the obtained standard text set may include a plurality of texts that have been labeled (i.e., labeling texts in the labeled text set, where the generation of the labels is not limited in any way). In one embodiment, the set of annotation text may be obtained by way of user upload or may be obtained from an (external or internal) memory in which the set of annotation text is pre-stored. In one embodiment, the set of tagged text can be used for training of a classification model, e.g., the set of tagged text can be partitioned into training data and validation data. The training data is used to train the classification model. The verification data is used for predicting and verifying the trained classification model, so that the performance of the trained classification model is analyzed.

In step 103, options relating to various model algorithms may be provided. In particular, the user may be provided with an option corresponding to a default model algorithm and an option corresponding to at least one candidate model algorithm.

As an example, the options here may be various interactive controls for assisting the user in selecting the corresponding algorithm. The candidate model algorithm may be a more typical machine learning algorithm for text classification.

In step 105, it is determined whether the user selects a default model algorithm. Whether the user selects the default model algorithm may be determined by the user's selection of the provided options.

The operation when it is determined in step 105 that the user has selected the default model algorithm, i.e., the operation with which the system automatically determines the appropriate model algorithm and performs the training process, is shown in steps 109 to 113.

In step 109, the annotation text is analyzed. Specifically, when the user selects the default model algorithm through the option corresponding to the default model algorithm, the labeled texts in the labeled text set are analyzed to obtain information on at least one of the length of the labeled texts, the quantity of the labeled texts, and the number of the labeled types.

In one embodiment, the information regarding the length of the annotation text may include: short text, medium text, long text. As an example, text including a number of words (or characters) less than 120 may be referred to as short text; a text including a number of words (or characters) between 120 and 200 may be referred to as a medium text; text that includes words (or characters) greater than 200 in number may be referred to as long text. However, this is merely exemplary, and more or less divisions of the length of the annotation text may be made.

The information on the number of the annotation texts may include: small samples with the number of the annotated texts being less than 10 ten thousand (w), medium samples with the number of the annotated texts being between 10w and 100w, and large samples with the number of the annotated texts being more than 100 w. However, this is merely exemplary, and more or less divisions of the number of annotation texts may be made.

The information regarding the number of label categories may include: two classes and multiple classes (e.g., three classes greater than two, five classes, fourteen classes, etc.). However, this is merely exemplary, and more or less divisions may be made for the number of annotation categories.

In step 111, a model algorithm is determined based on the analysis results of step 109. Specifically, a model algorithm for text classification is determined based on the information obtained in step 109.

In one embodiment, the step of determining a model algorithm for text classification based on the information obtained in step 109 may comprise: when the information obtained in step 109 indicates that the length of the labeled text is short, the quantity of the labeled text is small and/or the number of the labeled types is small, determining the model algorithm which is not based on the neural network as the model algorithm for text classification; when the information obtained in step 109 indicates that the length of the label text is long, the number of label texts is large, and/or the number of label types is large, the neural network-based model algorithm is determined as the model algorithm for text classification.

In another embodiment, a model algorithm for text classification may be determined according to table 1 below.

TABLE 1

Two classes	Short (<120)	Middle (120-	Length (>200)
				A few (<10w)	NB	NB	NB
Middle (10w-100w)	TextCNN	TextCNN	TextCNN
				A plurality of>100w)	FastText	FastText	FastText
Multiple classifications	Short (<120)	Middle (120-	Length (>200)
				A few (<10w)	FastText	FastText	FastText
Middle (10w-100w)	TextCNN	TextCNN	TextCNN
				A plurality of>100w)	FastText	FastText	FastText

In table 1, "short (< 120), medium (120-200), and long (> 200)" may indicate information about the length of the markup text, "small (< 10w), medium (10w-100w), and medium (>100 w)" may indicate information about the number of the markup text, "two-class, and multi-class" may indicate information about the number of the markup classes. "NB" in table 1 denotes a polynomial-based naive bayesian classification algorithm (MultinomialNB), "TextCNN" may represent a typical algorithm for classifying texts using a convolutional neural network, and "FastText" may represent an algorithm for classifying texts using a shallow network.

Referring to table 1, as a preferred embodiment of the inventive concept, in the case where the information obtained in step 109 indicates that the number of label categories is two categories, when the number of label texts is less than 10 ten thousand, multinomial nb is determined as a model algorithm for text classification; when the number of the labeled texts is between 10 ten thousand and 100 ten thousand, determining the TextCNN as a model algorithm for text classification; when the number of the indicated labeled texts is more than 100 ten thousand, FastText is determined as a model algorithm for text classification. In the case that the information obtained in step 109 indicates multiple classifications, when the number of tagged texts is less than 10 ten thousand or more than 100 ten thousand, determining FastText as a model algorithm for text classification; when the number of the annotation texts is between 10 ten thousand and 100 ten thousand, the TextCNN is determined as a model algorithm for text classification.

However, the above is only one preferred embodiment of the inventive concept, and the inventive concept is not limited thereto, and for example, in the case of the binary classification, when the number of the annotation texts is less than 10 ten thousand, other forms of naive bayes algorithm or logistic regression algorithm may be used instead of the multinomial nb.

According to the embodiment, the characteristics of the labeled text set can be analyzed, and the model algorithm matched with the characteristics of the text set can be provided by combining knowledge, experience and/or experiment.

With continued reference to FIG. 1, in step 113, the classification model is trained using the set of annotated text according to the determined model algorithm.

For the model algorithm determined in step 111, the method according to the present inventive concept may also automatically provide an initial hyper-parameter and/or auto-parameter-tuning scheme corresponding to the determined model algorithm, thereby enabling a non-professional unfamiliar with machine learning techniques to easily complete model training and guarantee the classification effect of the model for text classification.

In detail, in one embodiment, step 113 may comprise: and training a classification model by utilizing the labeled text set according to the model algorithm determined in the step 111 and the preset initial hyper-parameter corresponding to the determined model algorithm. In another embodiment, step 113 may include: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the determined model algorithm and the preset initial hyper-parameters corresponding to the determined model algorithm.

According to the exemplary embodiment of the present invention, not only the model algorithm automatically adapted to the labeled text set can be provided, but also the initial hyper-parameters matched with the model algorithm can be automatically set further according to knowledge, experience and/or experiment, and even an effective automatic parameter adjusting mode can be further packaged.

With continued reference to FIG. 1, in step 107, when it is determined in step 105 that the user has not selected the default model algorithm, training is performed in accordance with the particular candidate model algorithm. Step 107 represents operations for a professional with machine learning techniques. In this case, the user having the machine learning technique can autonomously select a model algorithm that he or she considers suitable or preferable by an option corresponding to at least one candidate model algorithm according to his or her own knowledge or experience.

In step 107, in case the user selects a specific candidate model algorithm among the at least one candidate model algorithm through the option corresponding to the at least one candidate model algorithm, the classification model is trained with the labeled text set according to the selected specific candidate model algorithm.

According to an embodiment of the inventive concept, in case a specific candidate model algorithm is selected, a hyper-parametric configuration of the selected specific candidate model algorithm may be allowed. Specifically, in the event that a user selects a particular candidate model algorithm among the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, the user is provided with a control for configuring a hyper-parameter of the particular candidate model algorithm.

Through which a user can autonomously configure various parameters and/or auto-tune schemes for a particular candidate model algorithm.

According to an embodiment of the inventive concept, step 107 may include: and training a classification model by utilizing the labeled text set according to the specific model algorithm and the preset initial hyper-parameters corresponding to the specific model algorithm. According to another embodiment of the inventive concept, step 107 may include: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the specific model algorithm and the preset initial hyper-parameters corresponding to the specific model algorithm. That is, the initial hyper-parameters matched with the model algorithm can be automatically provided and/or automatically adjusted by the system on the premise that the user selects the specific model algorithm.

Steps 101 to 113 described above with reference to fig. 1 are merely exemplary, the inventive concept is not limited thereto, and a method according to an exemplary embodiment may include more or less steps. For example, a method according to an example embodiment may further include classifying the unlabeled text using the trained classification model.

Fig. 2 is a block diagram illustrating a system 200 that performs a machine learning process for text classification, according to an example embodiment.

Referring to fig. 2, a system 200 for performing a machine learning process for text classification may include an acquisition unit 201, an interaction unit 203, and a training unit 205.

The acquisition unit 201 can acquire the set of annotation texts. That is, the acquisition unit 201 may perform the operation of step 101 as described in fig. 1, and a repetitive description thereof is omitted here for the sake of brevity.

The interaction unit 203 may provide the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm. That is, the interaction unit 203 may perform the operations of steps 103 and/or 105 as described in fig. 1, and a repeated description thereof is omitted herein for simplicity.

The training unit 205 may train a classification model using the labeled text set according to a specific candidate model algorithm in a case where a user selects the specific candidate model algorithm among the at least one candidate model algorithm through an option corresponding to the specific candidate model algorithm; or analyzing the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

That is, the training unit 205 may perform the operations of step 107 to step 113 as described in fig. 1, and a repeated description thereof is omitted herein for simplicity.

In addition, the system 200 for performing a machine learning process for text classification may further include a classification unit 207 for classifying unlabeled text using a trained classification model.

Fig. 3A illustrates an operation in a case where the user selects the default model algorithm through an option corresponding to the default model algorithm.

A system for performing a machine learning process for text classification according to the present inventive concepts may enable a user to create models in the form of modular text classification operators.

For example, the user may complete uploading the labeled text set (and further divide the labeled text set into training data and verification data for training) by dragging and correspondingly configuring the modules of "training data" and "verification data".

Thereafter, the user may drag the "text classification" module and connect it to training data and validation data, respectively. By way of example, the user may further configure the training operator for text classification on the right side of the screen, for example, name the operator, such as "microblog sentiment classification", and further select a text field and a corresponding label column (also referred to as a label field, label, etc.) for classification.

Specifically, through the "schema selection" control of FIGS. 3A and 3B, an option corresponding to a default model algorithm and an option corresponding to at least one candidate model algorithm may be provided.

When the default scheme is selected in the "scheme selection" of fig. 3A (i.e., the user selects the default model algorithm), the system automatically analyzes the labeled text, determines the model algorithm for text classification based on the analysis result, and then completes the training of the classification model according to the determined model algorithm.

According to the operation shown in fig. 3A, the system for performing a machine learning process for text classification can efficiently complete model training only by inputting data of a model and a desired classification label without knowing details of the model or algorithm, greatly reducing difficulty of modeling, and thus easily performing classification model training.

In fig. 3A, the system for performing a machine learning process for text classification according to the inventive concept does not provide or display any setting information about a hyper-parameter or a tune parameter scheme to a user when the option of "scheme selection" is a default scheme. However, according to another embodiment, the system for performing a machine learning process for text classification according to the present inventive concept may also provide or display the name of the "determined model algorithm" and the corresponding default initial hyper-parameters or information about the auto-tune solution to the user, thereby helping the user to learn more about the technical details of the model training.

Fig. 3B illustrates an operation in a case where a user selects a specific candidate model algorithm among at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm.

In contrast to FIG. 3A, the option of "solution selection" of FIG. 3B is TextCNN selected autonomously by the user, i.e., the particular candidate model algorithm selected by the user is TextCNN. In this case, the system may display controls to the user that configure the hyper-parameters of the TextCNN (e.g., number of training rounds, learning rate, batch size, activation function, volume and number, pooling, and deactivation rate in fig. 3B), etc. The user can configure the hyper-parameters autonomously through these controls. According to the operation shown in fig. 3B, the system for performing a machine learning process for text classification can also enable a professional with machine learning techniques to flexibly control the training process, ensuring the classification effect of the model for text classification.

Fig. 3A and 3B are only exemplary operations of a system for performing a machine learning process for text classification according to the present inventive concept, and those skilled in the art may make various modifications on the basis of the exemplary embodiments of the present invention. For example, in the example of FIG. 3B, even if the user selects a particular candidate algorithm, model training may be performed according to the initial hyper-parameters and/or auto-parameterization scheme of the system default settings, which may be exposed to the user or hidden. In addition, in the example of fig. 3A, in the case that the system automatically confirms the model algorithm, the user may also be provided with an option of setting the initial hyper-parameters and/or the parameter adjusting manner, so that the user can perform the relevant setting autonomously.

Steps

401, 403, 405 and 407 in fig. 4 correspond to

steps

101, 109, 111 and 113 of fig. 1, respectively, and their repeated descriptions are omitted here for the sake of brevity.

In addition, the method of FIG. 4 may further include classifying the unlabeled text using the trained classification model.

Fig. 5 is a block diagram illustrating a system 500 that performs a machine learning process for text classification according to another example embodiment.

Referring to fig. 5, a system 500 for performing a machine learning process for text classification may include an acquisition unit 501 and a training unit 503.

According to the inventive concept, the acquisition unit 501 may perform the operations of step 401 shown in fig. 4 and step 101 shown in fig. 1, and the training unit may perform the operations of step 403 to step 407 shown in fig. 4 and step 109 to step 113 shown in fig. 1, whose repeated description is omitted herein for the sake of brevity.

That is, in the exemplary embodiments of fig. 4 and fig. 5, the step of automatically confirming the model algorithm is directly performed, and the user cannot actively select the model algorithm, and details of the part overlapping with the related descriptions of fig. 1 to fig. 4 will not be repeated.

Further, the system 500 for performing a machine learning process for text classification may further comprise a classification unit 505 for classifying unlabeled text using a trained classification model.

The method of performing a machine learning process for text classification and the system thereof according to the present inventive concept can automatically select an appropriate model for a text classification problem, apart from the context of the text. According to the method and the system for executing the machine learning process aiming at the text classification, for a given labeled data set, a proper model and/or related parameters can be automatically recommended according to the characteristics of the labeled text set, so that a layman without a machine learning technology can easily finish model training and/or classification prediction; other candidate model algorithms and the hyper-parameter configuration thereof can be provided, so that a professional with machine learning technology can flexibly control the training process, and the classification effect of the model for text classification is ensured.

According to example embodiments of the inventive concept, the various steps of the methods depicted in fig. 1 and 4, as well as the various units depicted in fig. 2, 3A, 3B, and 5 and their operations may be written as programs or software. Programs or software may be written in any programming language based on the block diagrams and flow diagrams illustrated in the figures and the corresponding description in the specification. In one example, the program or software can include machine code that is directly executed by one or more processors or computers, such as machine code produced by a compiler. In another example, the program or software includes higher level code that is executed by one or more processors or computers using an interpreter. The program or software may be recorded, stored, or fixed in one or more non-transitory computer-readable storage media. In one example, the program or software or one or more non-transitory computer-readable storage media may be distributed on a computer system.

According to example embodiments of the inventive concept, the various steps of the methods depicted in fig. 1 and 4, as well as the various elements and operations thereof depicted in fig. 2, 3A, 3B, and 5, may be implemented on a computing device that includes a processor and a memory. The memory stores program instructions for controlling the processor to implement the operations of the various units described above.

Although specific example embodiments of the present invention have been described in detail above with reference to fig. 1 to 5, the present invention may be modified in various forms without departing from the spirit and scope of the inventive concept. Suitable results may be achieved if the described techniques are performed in a different order and/or if components in the described systems, architectures, or devices are combined in a different manner and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the present disclosure is defined not by the detailed description but by the claims and their equivalents, and all changes within the scope of the claims and their equivalents are to be construed as being included in the present disclosure.

Claims

1. A method of performing a machine learning process for text classification, comprising:

acquiring a label text set;

providing the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm;

under the condition that a user selects a specific candidate model algorithm in the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, training a classification model by using the labeled text set according to the specific candidate model algorithm; and

under the condition that a user selects a default model algorithm through an option corresponding to the default model algorithm, analyzing the marked texts in the marked text set to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

2. The method of claim 1, further comprising: and classifying the unlabeled text by using the trained classification model.

3. The method of claim 1, wherein the step of determining a model algorithm for text classification based on the information comprises:

when the information indicates that the length of the marked text is short, the quantity of the marked text is small and/or the number of the marked types is small, determining a model algorithm which is not based on a neural network as a model algorithm for text classification;

and when the information indicates that the length of the labeled text is long, the quantity of the labeled text is large and/or the number of the labeled types is large, determining the model algorithm based on the neural network as the model algorithm for text classification.

4. The method of claim 1, wherein the step of determining a model algorithm for text classification based on the information comprises:

in case the information indicates that the number of label categories is two categories,

when the information indicates that the number of the marked texts is less than 10 ten thousand, determining a naive Bayes classification algorithm based on a polynomial as a model algorithm for text classification;

when the information indicates that the number of the marked texts is between 10 ten thousand and 100 ten thousand, determining an algorithm for classifying the texts by using a convolutional neural network as a model algorithm for text classification;

and when the information indicates that the number of the marked texts is more than 100 ten thousand, determining an algorithm for classifying the texts by utilizing the shallow network as a model algorithm for text classification.

5. The method of claim 1, wherein the step of determining a model algorithm for text classification based on the information comprises:

in case the information indicates a multi-classification with a number of label categories greater than two,

when the information indicates that the number of the marked texts is less than 10 ten thousand or more than 100 ten thousand, determining an algorithm for classifying the texts by utilizing a shallow network as a model algorithm for text classification;

when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, determining an algorithm for classifying the texts by using the convolutional neural network as a model algorithm for text classification.

6. The method of claim 1, wherein training a classification model using the set of labeled text according to a determined model algorithm comprises: and training a classification model by using the labeled text set according to the determined model algorithm and preset initial hyper-parameters corresponding to the determined model algorithm.

7. The method of claim 1, wherein training a classification model using the set of labeled text according to a determined model algorithm comprises: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the determined model algorithm and the preset initial hyper-parameters corresponding to the determined model algorithm.

8. The method of claim 1, further comprising:

in the event that a user selects a particular candidate model algorithm among at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm, a control is provided to the user for configuring a hyper-parameter of the particular candidate model algorithm.

9. The method of claim 1, wherein training a classification model using the set of labeled text according to the particular candidate model algorithm comprises: and training a classification model by using the labeled text set according to the specific candidate model algorithm and a preset initial hyper-parameter corresponding to the specific candidate model algorithm.

10. The method of claim 1, wherein training a classification model using the set of labeled text according to the particular candidate model algorithm comprises: and training a classification model by utilizing the labeled text set through a corresponding automatic parameter adjusting scheme according to the specific candidate model algorithm and a preset initial hyper-parameter corresponding to the specific candidate model algorithm.

11. A system for performing a machine learning process for text classification, comprising:

the acquisition unit is used for acquiring the label text set;

an interaction unit providing the user with an option corresponding to the default model algorithm and an option corresponding to the at least one candidate model algorithm;

the training unit is used for training a classification model by utilizing the labeled text set according to the specific candidate model algorithm under the condition that a user selects the specific candidate model algorithm in the at least one candidate model algorithm through an option corresponding to the at least one candidate model algorithm; or under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm, analyzing the marked texts in the marked text set to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the number of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

12. The system of claim 11, further comprising a classification unit to classify the unlabeled text using the trained classification model.

13. The system of claim 11,

when the information indicates that the length of the labeled text is short, the quantity of the labeled text is small and/or the number of the labeled types is small, the training unit determines the model algorithm which is not based on the neural network as the model algorithm for text classification;

when the information indicates that the length of the labeled text is long, the quantity of the labeled text is large and/or the quantity of the labeled types is large, the training unit determines the model algorithm based on the neural network as the model algorithm for text classification.

14. The system of claim 11,

when the information indicates that the number of the marked texts is less than 10 ten thousand, the training unit determines a naive Bayes classification algorithm based on a polynomial as a model algorithm for text classification;

when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a convolutional neural network as a model algorithm for text classification;

when the information indicates that the number of the marked texts is more than 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a shallow network as a model algorithm for text classification.

15. The system of claim 11,

when the information indicates that the number of the marked texts is less than 10 ten thousand or more than 100 ten thousand, the training unit determines an algorithm for classifying the texts by using a shallow network as a model algorithm for text classification;

when the information indicates that the number of the labeled texts is between 10 ten thousand and 100 ten thousand, the training unit determines an algorithm for classifying the texts by using the convolutional neural network as a model algorithm for text classification.

16. The system of claim 11, wherein the training unit trains a classification model using the set of labeled text according to a determined model algorithm and preset initial hyper-parameters corresponding to the determined model algorithm.

17. The system of claim 11, wherein the training unit trains the classification model using the labeled text set according to a determined model algorithm and a preset initial hyper-parameter corresponding to the determined model algorithm through a corresponding auto-parametrization scheme.

18. The system of claim 11, wherein the interaction unit further provides a user with controls for configuring hyper-parameters of the particular candidate model algorithm.

19. The system of claim 11, wherein the training unit trains a classification model using the set of labeled text according to the particular candidate model algorithm and a preset initial hyper-parameter corresponding to the particular candidate model algorithm.

20. The system of claim 11, wherein the training unit trains the classification model using the labeled text set according to the specific candidate model algorithm and a preset initial hyper-parameter corresponding to the specific candidate model algorithm through a corresponding auto-parametrization scheme.

21. A method of performing a machine learning process for text classification, comprising:

acquiring a label text set;

analyzing the marked texts in the marked text set to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the quantity of the marked types;

determining a model algorithm for text classification based on the information; and

and training a classification model by using the labeled text set according to the determined model algorithm.

22. A system for performing a machine learning process for text classification, comprising:

the acquisition unit is used for acquiring the label text set;

the training unit is used for analyzing the marked texts in the marked text set under the condition that the user selects the default model algorithm through the option corresponding to the default model algorithm to obtain information about at least one of the length of the marked texts, the quantity of the marked texts and the quantity of the marked types; determining a model algorithm for text classification based on the information; and training a classification model by using the labeled text set according to the determined model algorithm.

23. A computer readable storage medium, characterized in that the computer readable storage medium stores program instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1 to 11 and 21.

24. A computing device, comprising:

a processor;

a memory storing program instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1 to 11 and claim 21.