WO2022089613A1 - 应用机器学习的文本分类方法、装置和电子设备 - Google Patents

应用机器学习的文本分类方法、装置和电子设备 Download PDF

Info

Publication number
WO2022089613A1
WO2022089613A1 PCT/CN2021/127675 CN2021127675W WO2022089613A1 WO 2022089613 A1 WO2022089613 A1 WO 2022089613A1 CN 2021127675 W CN2021127675 W CN 2021127675W WO 2022089613 A1 WO2022089613 A1 WO 2022089613A1
Authority
WO
WIPO (PCT)
Prior art keywords
text classification
model
data
online
text
Prior art date
Application number
PCT/CN2021/127675
Other languages
English (en)
French (fr)
Inventor
陶冶
陈伟
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2022089613A1 publication Critical patent/WO2022089613A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the embodiments of the present disclosure relate to the technical field of machine learning, and in particular, to a text classification method, apparatus, electronic device, and non-transitory computer-readable storage medium using machine learning.
  • Natural Language Processing (NLP) text classification is used more and more in all walks of life, but NLP text classification is still a field that requires strong professional skills.
  • NLP text classification is still a field that requires strong professional skills.
  • NLP text classification models is mainly implemented by professional modelers by manually writing codes or using graphical interfaces (such as DAG, Directed Acyclic Graph, directed acyclic graph).
  • graphical interfaces such as DAG, Directed Acyclic Graph, directed acyclic graph.
  • these two implementation methods Both require a high labor cost and time cost to obtain a satisfactory model, which is not conducive to the large-scale exploration and application of the model.
  • the software cost but for many small scenarios, the purchase of the graphics processor (Graphics Processing Unit, GPU) of the server is also a lot of expenses. Therefore, it is necessary to provide a text classification scheme applying machine learning.
  • GPU Graphics Processing Unit
  • An object of the embodiments of the present disclosure is to provide a new solution for text classification applying machine learning.
  • an embodiment of the present disclosure proposes a text classification method using machine learning, including:
  • the text classification application creation instruction of the text classification task In response to the text classification application creation instruction of the text classification task, obtain text classification application configuration information and text annotation data; the text annotation data includes a text column and a label column;
  • the text classification application is a first service program instance configured to perform model solution exploration based on the text classification application configuration information and the text annotation data, get the model scheme;
  • the text classification application In response to the instruction to start the text classification application, the text classification application is deployed online, and a text classification service address is generated, so that the text classification application provides the text classification task based on the text classification service address.
  • Online estimation service wherein, the online estimation service is performed based on online related data of the text classification task.
  • each line of text in the text column corresponds to one or more tags.
  • the text classification application is further configured to perform model self-learning based on the online related data and the model solution to obtain an online text classification model.
  • the method further includes:
  • a reflow data annotation interface to obtain reflow annotation data, and reflow the reflow annotation data to the first database again, wherein the reflow annotation data is annotated for the data returned to the first database get;
  • the text classification application performs model self-learning based on the online related data, the labeled intermediate data, the backflow labeling data, and the model solution, and obtains an online text classification model.
  • the method further includes: deploying the online text classification model online to provide a batch estimation service for the text classification task.
  • the batch estimation service includes: providing a batch estimation service interface, and the online text classification model obtains the data set to be batch estimated for the text classification task based on the batch estimation service interface, and A batch estimation result is output based on the data set to be batch estimated.
  • the data set to be estimated in batches includes a plurality of data columns, and the plurality of data columns includes a text column;
  • the outputting batch estimation results based on the data set to be batch estimated includes:
  • the estimated label column is spliced with the data set to be estimated in batches to obtain a batch estimated result and output.
  • the deploying the online text classification model includes: replacing the online text classification model with a machine learning model that has been deployed online.
  • the method before the text classification application creating instructions in response to the text classification task, the method further includes:
  • a user interface is provided, and based on the user interface, text classification scenarios and text classification tasks input by the user are received, and based on the user interface, a user-triggered text classification application creation instruction is received, and the text classification application creation instruction and the text classification input by the user are received.
  • Scenarios correspond to text classification tasks.
  • the creating a text classification application based on the text classification application configuration information includes:
  • the second service program instance is configured to perform model solution exploration based on the text classification application configuration information and the text annotation data to obtain a model solution;
  • the third service program instance is configured to perform model self-learning based on the online related data and the model solution to obtain an online text classification model.
  • the text classification application configuration information includes one or more of the following:
  • the text classification application is configured to perform model solution exploration based on the data in the first database, the text classification application configuration information, and the text annotation data to obtain a model solution; wherein, the The described model scheme includes the following scheme sub-items: feature engineering scheme, model algorithm and model hyperparameters;
  • the deploying the text classification application online includes: deploying the model solution obtained through exploration online.
  • the online text classification model is obtained by training an offline model; wherein, the offline model is a model generated in the process of exploring the model solution, and when the model solution obtained from the exploration is deployed online, it will also The offline model is deployed online.
  • the online text classification model is a model generated based on the model algorithm in the model solution and the hyperparameters of the model; and when the model solution obtained through exploration is deployed online, the offline model is not deployed online.
  • an embodiment of the present disclosure further provides a text classification apparatus applying machine learning, including:
  • a text classification application creation module is configured to obtain text classification application configuration information and text annotation data in response to a text classification application creation instruction of the text classification task; the text annotation data includes a text column and a label column; based on the text Classification application configuration information, creating a text classification application; wherein, the text classification application is a first service program instance, and is configured to perform model solution exploration based on the text classification application configuration information and the text annotation data to obtain a model solution;
  • a text classification application startup module configured to, in response to an instruction to start the text classification application, deploy the text classification application online, and generate a text classification service address, so that the text classification application is based on the text classification service address , providing an online estimation service for the text classification task; wherein, the online estimation service is performed based on the online related data of the text classification task.
  • each line of text in the text column corresponds to one or more tags.
  • the text classification application is further configured to perform model self-learning based on the online related data and the model solution to obtain an online text classification model.
  • the text classification application is further configured to reflow the online related data into the first database, and reflow the labeled intermediate data generated by the model scheme based on the online related data into the first database. in the first database;
  • the machine learning-applied text classification device further includes a labeling interface module configured to provide a backflow data labeling interface to obtain backflow labeling data, and return the backflow labeling data to the first database again , wherein the backflow labeling data is obtained by labeling the data backflow into the first database;
  • the text classification application performs model self-learning based on the online related data, the labeled intermediate data, the backflow labeling data, and the model solution, and obtains an online text classification model.
  • the text classification application startup module is further configured to deploy the online text classification model online, so as to provide a batch estimation service for the text classification task.
  • the text classification application startup module is further configured to provide a batch estimation service interface, and the online text classification model obtains the data to be batch estimated for the text classification task based on the batch estimation service interface and output batch estimation results based on the dataset to be batch estimated.
  • the data set to be estimated in batches includes a plurality of data columns, and the plurality of data columns includes a text column;
  • the online text classification model outputting batch estimation results based on the data set to be estimated in batches includes: performing batch estimation based on the text columns in the data set to be estimated in batches, obtaining an estimated label column, The estimated label column is spliced with the data set to be estimated in batches to obtain and output the batch estimated result.
  • deploying the online text classification model by the text classification application startup module includes: replacing the online text classification model with a machine learning model that has been deployed and online.
  • the text classification application creation module is further configured to:
  • a user interface Before the text classification application creation instruction in response to the text classification task, a user interface is provided, a text classification scenario and a text classification task input by a user are received based on the user interface, and a text classification application creation instruction triggered by a user is received based on the user interface, The text classification application creation instruction corresponds to the text classification scene and the text classification task input by the user.
  • the text classification application creation module based on the text classification application configuration information, creates a text classification application comprising:
  • the second service program instance is configured to perform model solution exploration based on the text classification application configuration information and the text annotation data to obtain a model solution;
  • the third service program instance is configured to perform model self-learning based on the online related data and the model solution to obtain an online text classification model.
  • the text classification application configuration information includes one or more of the following:
  • the text classification application is configured to perform model solution exploration based on the data in the first database, the text classification application configuration information, and the text annotation data to obtain a model solution; wherein, the The described model scheme includes the following scheme sub-items: feature engineering scheme, model algorithm and model hyperparameters;
  • deploying the text classification application online by the text classification application startup module includes: deploying and online the model solution obtained through exploration.
  • the online text classification model is obtained by training an offline model; wherein, the offline model is a model generated in the process of exploring the model solution, and when the model solution obtained from the exploration is deployed online, it will also The offline model is deployed online.
  • the online text classification model is a model generated based on the model algorithm in the model solution and the hyperparameters of the model; and when the model solution obtained through exploration is deployed online, the offline model is not deployed online.
  • an embodiment of the present disclosure further provides an electronic device, including: a processor and a memory; the processor is configured to execute any embodiment of the first aspect by invoking a program or an instruction stored in the memory. method steps.
  • the embodiments of the present disclosure further provide a non-transitory computer-readable storage medium configured to store programs or instructions, the programs or instructions causing a computer to execute the method steps according to any one of the embodiments of the first aspect.
  • a text classification application can be created by specifying text classification scenarios, text classification tasks, and text classification application configuration information, thereby reducing the landing cost of NLP text classification capabilities.
  • model self-learning can be performed based on online related data and model solutions to obtain an online text classification model, so as to realize automatic model construction and reduce model construction costs.
  • reusable data for model building is obtained.
  • a batch estimation service for text classification tasks can be provided.
  • the model solution obtained from exploration, and the intermediate data generated by the batch estimation service the model can be self-learned and the model can be automatically updated iteratively.
  • FIG. 1 is an exemplary application scenario diagram of text classification applying machine learning according to an embodiment of the present disclosure
  • FIG. 2 is an exemplary block diagram of a scene module provided by an embodiment of the present disclosure
  • FIG. 3 is an exemplary block diagram of a text classification apparatus applying machine learning according to an embodiment of the present disclosure
  • FIG. 4 is an exemplary architecture diagram of a text classification application providing an online estimation service or a batch estimation service according to an embodiment of the present disclosure
  • FIG. 5 is an exemplary block diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 6 is an exemplary flowchart of a text classification method applying machine learning according to an embodiment of the present disclosure
  • FIG. 7 to FIG. 15 are schematic interface diagrams related to a text classification process applying machine learning according to an embodiment of the present disclosure.
  • the prediction effect of the NLP text classification model will be attenuated, so professional modelers are required to re-model and optimize, that is, repeated personnel investment is required every period of time, making The more modeling, the higher the personnel cost. Therefore, how to realize the automatic iterative update of the model to ensure the prediction effect of the model is also an urgent problem to be solved.
  • the embodiments of the present disclosure provide a text classification scheme using machine learning, which can apply NLP text classification from problem definition, to modeling, to model online service, offline batch prediction, and subsequent collection of feedback and model analysis. Iterative updates form a learning closed loop.
  • NLP text classification models By automatically training NLP text classification models and automatically launching NLP text classification applications, it fundamentally solves the difficult and costly problems of NLP text classification capabilities. It enables people without NLP-related experience to complete the entire process of landing text classification scenarios. At the same time, it supports both CPU and GPU modes, and the exploration of text classification can be completed in small scenes without GPU. At the same time, it supports the classification of Chinese text and English text, which further expands the scope of scene coverage.
  • non-professional modelers or people without NLP related experience can specify text classification scenarios, text classification tasks, text classification application configuration information, and text annotation data, and then the text classification scheme using machine learning can be based on text.
  • the classification application configuration information automatically creates a text classification application.
  • the text classification application can explore the model scheme based on the text classification application configuration information and the text annotation data, and obtain the model scheme; then the created text classification application can be deployed online and generate a text classification service address, so that the text classification application can provide online prediction service for text classification tasks based on the text classification service address; wherein, the online prediction service is performed based on online related data of the text classification task.
  • the created text classification application can automatically explore the model solution, realize the full automation of the model building process, and further reduce the modeling cost.
  • the created text classification application can also perform model self-learning based on the acquired online related data of the text classification task and the model solution obtained through exploration, to obtain an online text classification model. It can be seen that, in the embodiment of the present disclosure, the created text classification application can perform model self-learning, realize automatic iterative update of the model, and ensure the effect of model estimation. Models and applications can be built with low thresholds without professional modelers and machine learning knowledge reserves.
  • FIG. 1 is an exemplary application scenario diagram of text classification applying machine learning according to an embodiment of the present disclosure.
  • the scene module 11 and the text classification apparatus 12 applying machine learning can be connected with the text classification scene.
  • the text classification scene can be specified by the user, and further, the user can also specify the text classification task under the text classification scene.
  • the text classification device 12 applying machine learning can Create a corresponding text classification application, for example, create a text classification application for one task; create different text classification applications for different tasks.
  • the text classification application is configured to handle corresponding text classification tasks, such as real-time estimation tasks or batch estimation tasks.
  • real-time estimation is to perform estimation (that is, text classification) after receiving an estimation request (that is, text classification request); while batch estimation is non-real-time estimation, which is estimated in batches through timing or event triggering , for example, batch estimation is performed for multiple estimation requests only when the preset batch estimation conditions are met. For example, batch estimation is performed only when estimation requests accumulate a preset number.
  • the scene module 11 is configured to implement text classification scene definition.
  • text categorization scenario definition may be done by a user, eg, a user may define a news categorization scenario. Accordingly, the scene module 11 receives the scene definition information input by the user.
  • the scenario module 11 may provide a user interface through which the user inputs scenario definition information to specify text classification scenarios and text classification tasks.
  • the scenario definition information may include, but is not limited to, one or more of the following: scenario name, task name, task ID, related data definition of the task, and the like.
  • the relevant data definitions for different tasks are different.
  • the relevant data definitions may be data table schema (Schema) definitions.
  • the schema definition includes, but is not limited to, one or more of the following: the name of one or more data tables, the fields included in each data table, and the data relationships among the plurality of data tables.
  • the scene module 11 is also configured for data access. For example, the scene module 11 acquires relevant data of the text classification task based on the text classification task of the text classification scene.
  • relevant data may include, but is not limited to, request data, exposure data, and feedback data.
  • Scenario definition information may include, but is not limited to:
  • Relevant data includes but is not limited to: request data, exposure data, feedback data and business data.
  • the request data refers to the information sent to the text classification application. For example, there are 10,000 pieces of news, which are combined with other information to be classified by the text classification application. These 10,000 pieces of news are the request data; after being estimated by the text classification application, the actual When watching news, you will not read all 10,000 news items.
  • the business party or customer will only select the news that you are interested in to watch, maybe only 100 items, and these 100 items are exposure data; finally, the news you actually see belongs to What category is the feedback data.
  • the scene may also contain business data.
  • Business data is other information that may help improve the estimated effect of text classification applications, such as basic customer information, customer remarks, etc. BO( Business Object) data. Among them, there may be no business data, or there may be more than one.
  • a) Define the schema of each related data flow (request data flow, exposure data flow, feedback data flow and business data flow), for example, which fields are included in each data flow, and the specific information to be configured includes field name, field type and field remarks (optional), etc.
  • behavior data will be constructed according to the inner join method of the request data and exposure data, and the behavior data can be used for subsequent model solution exploration and model self-learning.
  • a) Define the time field for behavioral data.
  • the behavior data select a time type field as the main time field, and this time field should be the actual occurrence time of the behavior.
  • the label Defines the feedback field (label) and type of the feedback data. Select the label field in the feedback data. For the two-category scenario, the label is 1 or 0 representing positive and negative samples. In the regression scenario, the label is a continuous value that represents the actual situation, such as the value of PM2.5. After selecting the label field, you also need to choose what type of label belongs to, whether it is binary classification or regression, or multi-classification.
  • c) Define the concatenated fields of behavioral data and feedback data. Define which fields of behavior data and feedback data are used as associated keys for splicing.
  • the splicing key can also support multiple sets of keys, that is, when multiple fields are equal, it is considered that a certain behavior data is related to a certain feedback data. correspond.
  • d) Define the tag type and whether to use the fields in each data. For a certain field type, in order for automatic modeling to correctly identify the business meaning and achieve better results, the user needs to specify the specific tag type. For example, for an int type field, whether it is a continuous type or a discrete type, which will determine the automatic modeling What kind of data change strategy the algorithm does for this column of data. In addition, it is also necessary to mark whether each column of data is used in the model solution exploration, because in the text classification scenario, there may be some fields that are meaningless and can not be included in the learning, or some fields are strongly related to the label. It should not be incorporated into learning, so it needs to be annotated during use.
  • e Define the relationship between the data tables. It is necessary to define the data relationship between behavior data and business data, so that automatic modeling of multiple tables can be completed during automatic modeling.
  • the relationship type includes but is not limited to 1:1, 1:N, among which, in the 1:N relationship type, it is also necessary to specify the sub-table (assuming table A and table B, the data of table B will be spelled into table A to go , then the table type of table B is called the sub-table of table A), whether it is an event table or a slice table.
  • Data table splicing not only supports the connection between the behavior table and the business data table, but also supports the direct connection between the business data table and the business data table.
  • a text classification scenario can be formally created.
  • the scene module 11 will automatically start a data splicing task to splicing the request data and exposure data into behavior data for subsequent model solution exploration and model self-learning.
  • the machine learning-applied text classification device 12 is configured to implement a text classification application configuration.
  • the text classification application configuration can be completed by the user, for example, which business data is used to participate in model solution exploration and model self-learning, and, for example, the data range used for model self-learning.
  • the machine learning-applied text classification device 12 may receive text classification application configuration information and text annotation data input by the user, and the text annotation data is used for model solution exploration, wherein the text annotation data at least includes a text column and a label column.
  • the text column stores text data
  • the label column stores label data
  • each line of text in the text column corresponds to one or more labels.
  • the machine learning-applied text classification apparatus 12 may provide a user interface through which the user enters text classification application configuration information and uploads text annotation data.
  • the machine learning-applied text classification device 12 may provide a user interface and receive user triggers based on the user interface.
  • the text classification application creation instruction of the text classification application creation instruction corresponds to the text classification scene and the text classification task input by the user.
  • the machine learning-applied text classification apparatus 12 may respond to the text classification application creation instruction and display a user interface, so as to pass the text classification application creation instruction.
  • the user interface acquires the text classification application configuration information and text annotation data input by the user.
  • the text classification application configuration information may include, but is not limited to, one or more of the following:
  • the frequency of model self-learning can be configured to trigger learning every time new data arrives or trigger learning every time “N” pieces of data are added.
  • the evaluation indicators of model self-learning include: one or more of four indicators: P (precision rate), R (recall rate), F1, and ACC (accuracy rate).
  • the language of the dataset includes two language models, Chinese and English. According to the different language types selected by the user, different models are automatically matched for training.
  • GPU acceleration After it is turned on, the GPU will be used for model training. If there is no GPU resource, this switch can be turned off, and the model can also be trained using CPU resources.
  • the proportion of evaluation data of the model what proportion of the data is used for model evaluation in each learning.
  • the default value is 8%.
  • the text classification application configuration information may also include, but is not limited to, one or more of the following:
  • the computing power level can be understood as the complexity of model solution exploration and model self-learning.
  • the higher the computing power level the wider the search space for model solution exploration and model self-learning, and the better the prediction effect of the model obtained by model self-learning.
  • the evaluation data range of the model specifies the data range of the model used to evaluate the self-learning output of the model.
  • whether the model is automatically online specifies whether the model generated by the continuous iterative update of the model self-learning is automatically online. If the model is set to go online automatically, and the model generated by the model self-learning is better than the model that has been deployed online, the model generated by the model self-learning will be automatically online. If the model is not set to go online automatically, you can only manually go online with the model produced by the model's self-learning.
  • whether to use the offline model obtained by the model solution exploration specifies whether to bring the offline model online. If the offline model is not used, only after the model solution is online, the model solution will not output the estimated result, and the output to the text classification scene is a
  • the default prediction result (for example, the default prediction value), the default prediction value is 0.5, for example, you need to wait for the model to self-learn to output the model and go online before the model can output the prediction result.
  • the offline model will also be online at the same time as the model solution is launched. The offline model can output the estimated results. However, since the data used for model solution exploration may be different from the online data, the offline model estimates may be less effective.
  • the machine learning-applied text classification apparatus 12 is further configured to create a text classification application.
  • the text classification apparatus 12 applying machine learning creates a text classification application based on the text classification application configuration information; wherein, the text classification application is the first service program instance configured to be based on the text classification application configuration information and the text annotation data uploaded by the user Explore the model scheme to get the model scheme.
  • a model scheme is a collection of various strategies for modeling, including but not limited to: how to filter data, how to build features, how to tune model hyperparameters, how to choose a model, how to train a model and other strategies.
  • a user interface may be displayed to prompt the user that the creation of the text classification application is completed, and the user may trigger an instruction to start the text classification application, for example, the user may click the user The "Text Classification Application Launch" button on the interface.
  • the machine learning-applied text classification apparatus 12 is further configured to deploy a text classification application online.
  • the text classification device 12 applying machine learning may, in response to an instruction to start the text classification application, deploy the text classification application online, and generate a text classification service address, so that the text classification application provides services for text classification tasks based on the text classification service address.
  • the online prediction service is based on the online related data of the text classification task.
  • the text classification application is also configured to perform model self-learning based on online related data and model solutions obtained from exploration, and obtain an online text classification model.
  • the model self-learning can use online related data to automatically learn the model on a regular or event-triggered basis, so that the latest data information and business changes can also be learned by the model, ensuring that the effect of the self-learning model continues to be good.
  • the functionality of the scene module 11 may be integrated into the text classification apparatus 12 applying machine learning.
  • FIG. 2 is an exemplary block diagram of a scene module 20 according to an embodiment of the present disclosure.
  • the scene module 20 may be implemented as the scene module 11 in FIG. 1 or a part of the scene module 11 .
  • the scene module 20 can be divided into multiple units, for example, including but not limited to: a data access unit 21 , a scene splicing unit 22 and a data management unit 23 .
  • the data access unit 21 is configured to perform data connection with the text classification scene.
  • the data access unit 21 may acquire relevant data of the text classification task based on the text classification task of the text classification scene.
  • the data access unit 21 may obtain relevant data definitions of the text classification task, and then perform data connection with the text classification scene based on the relevant data definitions to obtain relevant data of the text classification task.
  • the data access unit 21 may create a data interface corresponding to the relevant data definition based on the relevant data definition of the text classification task, and then obtain relevant data of the text classification task through the data interface.
  • the data interface takes the dynamic data table or the data group as the interface, or the data interface is the encapsulation interface, and the encapsulation interface is a unified interface obtained by encapsulating the dynamic data table and the data group.
  • the data interface interfaces with animation tables or data groups.
  • the data access unit 21 uses the dynamic data table or the data group as the data storage carrier.
  • the dynamic data table refers to a data table to which data (append) can still be added after the data table is created, and the data group refers to a series of isomorphic ( The data fields are the same) a combination of data slices.
  • append the data by adding data slices to the data group.
  • the corresponding dynamic data table or data group is used as an interface to import data, such as text annotation data.
  • the way of importing data includes but is not limited to one or more of single import, timed import and streaming import.
  • the streaming import is, for example, a Kafka (distributed publish-subscribe messaging system) import and the like. From the data source, it supports local import, database import, FTP (File Transfer Protocol, file transfer protocol) import, HDFS (Hadoop Distributed File System, Hadoop distributed file system) import, hive (Hadoop-based data warehouse tool) import, etc.
  • FTP File Transfer Protocol, file transfer protocol
  • HDFS High Speed Distributed File System
  • hive Hadoop-based data warehouse tool
  • the data interface is an encapsulation interface
  • the encapsulation interface is a unified interface obtained by encapsulating the dynamic data table and the data group.
  • the underlying data storage implementation is not exposed to the user, thereby improving the user experience.
  • users only need to expose four types of data interfaces: Request data, Impression data, feedback data and business data. Users only need to perceive these four data interfaces, and no longer need to perceive the specific corresponding data. What is a data set.
  • the scene splicing unit 22 is configured to splicing the request data and the exposure data in the related data to obtain behavior data, such as text classification behavior.
  • scene stitching unit 22 constructs behavioral data (which may also be referred to as sample data) in an inner join manner from the request data and exposure data.
  • the scene stitching unit 22 may use filters to process and flatten the request data and exposure data to construct behavioral data.
  • the scene splicing unit 22 may use a filter to filter the request data based on the exposure data to obtain intersection data; and then flatten the intersection data to obtain behavior data.
  • the exposure data has 10 pieces of data
  • the request data has 12 pieces of data
  • the exposure data and the request data have 10 pieces of the same data. Removed, and then processed the intersection data (the 10 identical data) by flattening to obtain behavior data.
  • the data management unit 23 is configured to manage the data in the first database and the data in the second database.
  • the first database is an offline database.
  • the offline database can be a distributed file storage system (HDFS, Hadoop Distributed File System), or other offline databases.
  • the second database is an online database, such as a real-time feature storage engine (RtiDB), and may also be other online databases.
  • the data management unit 23 may accumulate the online related data of the text classification task acquired by the data access unit 21 into the first database. In some embodiments, the data management unit 23 may accumulate the behavior data obtained by the scene splicing unit 22 into the first database. In some embodiments, the data management unit 23 may store the online related data in the second database.
  • each unit in the scene module 20 is only a logical function division, and there may be other division methods in actual implementation, such as the data access unit 21 , the scene splicing unit 22 and the data management unit 23 . At least two units may be implemented as one unit; the data access unit 21 , the scene splicing unit 22 or the data management unit 23 may also be divided into multiple subunits. It can be understood that each unit or sub-unit can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods for implementing the described functionality for each particular application.
  • FIG. 3 is an exemplary block diagram of a text classification apparatus 30 applying machine learning according to an embodiment of the present disclosure.
  • the machine learning-applied text classification apparatus 30 may be implemented as the machine learning-applied text classification apparatus 12 in FIG. 1 or a part thereof.
  • the text classification apparatus 30 applying machine learning can be divided into multiple units, for example, including but not limited to: a text classification application creation module 31 and a text classification application startup module 32 .
  • the text classification application creation module 31 is configured to implement text classification application configuration and create a text classification application.
  • the text classification application creation module 31 acquires text classification application configuration information and text annotation data in response to the text classification application creation instruction of the text classification task.
  • the text categorization application creation module 31 provides a user interface to receive a user-triggered text categorization application creation instruction. After the user triggers the text categorization application creation instruction, the text categorization application creation module 31 responds to the text categorization
  • the text classification application creation instruction of the task displays a user interface, so as to obtain the text classification application configuration information and text annotation data input by the user through the user interface.
  • the text classification application creation module 31 creates a text classification application based on the text classification application configuration information; wherein the text classification application is a first service program instance configured to be based on the text classification application configuration information and the The text annotation data is used to explore the model scheme, and the model scheme is obtained.
  • the text classification application may perform model solution exploration based on data in the first database (eg, one or more of request data, sample data, feedback data, business data, and exposure data) and text annotation data to obtain a model Program.
  • a model scheme includes the following scheme subitems: feature engineering scheme, model algorithm, and model hyperparameters.
  • the feature engineering solution at least has the function of spelling tables.
  • Feature engineering schemes can also have other capabilities, such as extracting features from data for use by model algorithms or models.
  • the model algorithm can be a commonly used machine learning algorithm, such as a supervised learning algorithm, including but not limited to: LR (Logistic Regression, logistic regression), GBDT (Gradient Boosting Decision Tree, gradient boosting iterative decision tree), DeepNN (Deep Neural Network , deep neural networks), etc.
  • the hyperparameters of the model are parameters that are preset before machine learning to assist model training, such as the number of categories in the clustering algorithm, the step size of the gradient descent method, the number of layers of the neural network, and the learning rate for training the neural network. .
  • the text classification application when the text classification application is exploring model solutions, at least two model solutions may be generated, wherein at least one solution sub-item is different between different model solutions.
  • the text classification application uses at least two model solutions to perform model training based on the data in the first database, and can obtain parameters of the model itself, wherein the parameters of the model itself are, for example: weights in the neural network, support vector machines Support vectors in , coefficients in linear regression or logistic regression, etc.
  • the text classification application may evaluate the models trained by at least two model solutions based on the evaluation index of the machine learning model, and then select from the at least two model solutions based on the evaluation results to obtain the discovered model solutions. model scheme.
  • the evaluation index of the machine learning model is, for example, the AUC (Area Under Curve) value.
  • the text classification application may return online related data of the text classification task into the first database, and return the labeled intermediate data generated by the model solution based on the online relevant data into the first database.
  • the intermediate data can be the wide-table feature data of the estimated sample (which can be understood as behavior data).
  • the machine learning-applied text classification device 30 provides a reflow data annotation interface to obtain reflow annotation data, and reflow the reflow annotation data into the first database again, wherein the reflow annotation data is for reflow to the first database The data in the database are marked.
  • the reflow data labeling interface can be provided by a special labeling platform, and the labeling platform can call and label the data returned to the first database according to the dimensions of time and data volume, and the labelled data will be returned again.
  • the self-learning effect of the model is further improved.
  • the text classification application performs model self-learning based on online related data, labeled intermediate data, backflow labeled data, and model solutions to obtain an online text classification model.
  • the text classification application creation module 31 may package the text classification application configuration information, the second service program instance and the third service program instance into a text classification application.
  • the second service program instance is configured to perform model solution exploration based on text classification application configuration information and text annotation data to obtain a model solution.
  • the third service program instance is configured to perform model self-learning based on the online related data of the text classification task and the model solution obtained through exploration, and obtain an online text classification model.
  • the text classification application launching module 32 is configured to deploy the text classification application to go online.
  • the text classification application launching module 32 in response to the instruction to start the text classification application, deploys the text classification application online, and generates a text classification service address, so that the text classification application provides services for text classification based on the text classification service address.
  • Online estimation service for tasks wherein, the online estimation service is performed based on online related data of text classification tasks.
  • the text classification application launching module 32 may deploy the model solution obtained by the exploration of the second service program instance of the text classification application online.
  • the model solution deployed online may be based on the labeled intermediate data generated from the online related data of the text classification task.
  • the third service program instance of the text classification application may perform model self-learning based on the online related data of the text classification task, the model solution obtained by the second service program instance, and the intermediate data generated by the model solution, and obtain Online text classification model.
  • the text classification application startup module 32 when the text classification application startup module 32 deploys the model solution online, it also deploys the offline model obtained during the model solution exploration process, and the offline model is based on the text classification accumulated in the first database (ie the offline database).
  • the related data and text annotation data of the task are obtained by training, and the offline model can be used to estimate the relevant data of the text classification scene after the offline model is deployed.
  • the data below is the same source.
  • the third service program instance of the text classification application obtains an online text classification model by training an offline model; wherein the offline model is a model generated during the process of exploring the model solution by the second service program instance of the text classification application, and
  • the text classification application launching module 32 deploys the model solution online, it also deploys the offline model online.
  • the third service program instance of the text classification application trains the offline model through the model algorithm in the model scheme and the hyperparameters of the model, updates the parameter values of the offline model itself, and obtains the online text classification model.
  • the text classification application startup module 32 only deploys the model solution online, but does not deploy the offline model obtained during the model solution exploration process, which can avoid the offline model directly deployed online due to online feature calculation and offline model.
  • the data obtained by the feature calculation is inconsistent, which leads to the problem that the prediction effect of the offline model deployed online is poor.
  • the estimated result will not be generated.
  • the default estimated result is output to the text classification scene. Ignore it after receiving the default estimate.
  • the third service program instance of the text classification application may be based on the online related data of the text classification task, the model algorithm and the hyperparameters of the model in the model solution explored based on the second service program instance, and the model solution
  • the generated intermediate data is subjected to model self-learning to generate an online text classification model; and when the text classification application startup module 32 deploys the model solution online, the offline model is not deployed online.
  • the text classification application launching module 32 can deploy the online text classification model online, so that the online text classification model can provide batch estimation services for text classification tasks.
  • the text classification application startup module 32 may provide a batch estimation service interface, and the batch estimation service interface is configured to obtain the data sets to be batch estimated for the text classification task.
  • the online text classification model deployed online can obtain the datasets to be estimated in batches through the batch estimation service interface, and output batch estimation results based on the datasets to be estimated in batches.
  • the data set to be estimated in batches includes multiple data columns, the multiple data columns include a text column, and may also include other columns, such as an ID column and a remarks column.
  • the online text classification model outputs batch estimation results based on the dataset to be batch estimated, including: performing batch estimation based on the text columns in the dataset to be batch estimated, obtaining an estimated label column, and comparing the estimated label column with the batch estimation
  • the estimated data sets are spliced to obtain batch estimation results and output.
  • the online text classification model uses the data in the second database and the received request data to perform online processing based on the feature engineering solution in the model solution deployed online. Real-time feature calculation is performed to obtain the feature data of the estimated sample.
  • the online text classification model receives the request data, based on the feature engineering solution in the model solution deployed online, the data in the second database and the received request data are assembled and obtained by online real-time feature calculation.
  • the characteristic data of the wide table, and the characteristic data of the obtained estimated sample is the characteristic data of the wide table.
  • the online text classification model can obtain the feature data (or wide-table feature data) of the estimated sample based on the model solution deployed online, splicing the feature data and the feedback data to generate the sample data with features and feedback, and the sample data also Other data may be included, such as timestamp data, etc.
  • splicing feature data and exposure data before the online text classification model splices feature data and feedback data, splicing feature data and exposure data to obtain feature data with exposure data; and then splicing feature data with exposure data and feedback data to generate exposure, feature and feedback sample data.
  • the online text classification model returns the sample data with features and feedback to the first database for model self-learning, and the online text classification model obtained by the model self-learning can be deployed online to ensure that the model self-learning can be used for
  • the data and feature engineering scheme of the model are consistent with the data and feature engineering scheme used by the model online prediction service respectively, so as to achieve the consistency of the model self-learning effect and the model prediction effect.
  • the process of model self-learning performed by the third service program instance of the text classification application is: based on the sample data with features and feedback, the model algorithm in the model solution and the model's hyperparameters are trained to obtain online text. classification model.
  • the text classification application deploying the model solution obtained through exploration includes: replacing the model solution obtained through exploration with the model solution that has been deployed online.
  • the text classification application startup module 32 can replace the online text classification model with the machine learning model that has been deployed online; or, deploy the online text classification model online, and together with the deployed and online machine learning model, provide text Batch estimation service for classification tasks.
  • each unit in the machine learning-applied text classification apparatus 30 is only a logical function division, and there may be other division methods in actual implementation, such as the text classification application creation module 31 and the text classification application startup module 32 can be implemented as one unit; the text classification application creation module 31 or the text classification application start module 32 can also be divided into a plurality of subunits.
  • each unit or sub-unit can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods for implementing the described functionality for each particular application.
  • FIG. 4 is an exemplary architecture diagram of an online prediction service or a batch prediction service provided by a text classification application according to an embodiment of the present disclosure.
  • text classification applications have at least two functions: model scheme exploration and model self-learning.
  • the text classification application may be a text classification application created by the text classification device 12 applying machine learning in FIG. 1 , and after the deployment of the text classification application goes online, the model solution obtained by the text classification application is also deployed online, The online text classification model obtained by the text classification application through model self-learning is also deployed online.
  • data connection with the text classification scene can be performed to realize data management.
  • the data management is the function of the data management unit 23 shown in FIG. 2 .
  • the second service program instance of the text classification application can explore the model scheme based on the configuration information of the text classification application and the text annotation data to obtain the model scheme, and then the model scheme can be deployed online to provide online estimation services or batches Estimation service (essentially, the estimation result is not output, and the output is the default estimation result, so it is represented by a dotted line in the figure), and the model scheme will return the intermediate data.
  • the third service program instance of the text classification application can perform model self-learning based on the reflowed intermediate data and model scheme, and generate an online text classification model, and then the online text classification model can be deployed online to provide online estimation services or batch estimation services. .
  • data management, model self-learning, and online estimation service can constitute a small closed loop; data management, model solution exploration, and online estimation service (or batch estimation service) constitute a large closed loop .
  • the small closed-loop ensures that the data and feature engineering solutions used in the model self-learning are the same as those used in the batch prediction service, so that the model self-learning effect and the model prediction effect are consistent.
  • the large closed-loop guarantees that the data used in the exploration of the model scheme (referred to as offline data) and the data used in the batch prediction service (referred to as online data) are of the same origin, realizing the same origin of offline and online data.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the text classification apparatus applying machine learning in FIG. 1 may be provided in an electronic device or implemented as an electronic device.
  • the electronic device includes: at least one processor 51 , at least one memory 52 and at least one communication interface 53 .
  • the various components in the electronic device are coupled together by a bus system 54 .
  • the communication interface 53 is configured for information transmission with external devices. Understandably, the bus system 54 is configured to enable connection communication between these components.
  • the bus system 54 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 54 in FIG. 5 .
  • the memory 52 in this embodiment may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • memory 52 stores the following elements, executable units or data structures, or subsets thereof, or extended sets of them: operating systems and applications.
  • the operating system including various system programs, such as a framework layer, a core library layer, a driver layer, etc., is configured to implement various basic tasks and process hardware-based tasks.
  • Applications including various applications, such as a media player (Media Player), a browser (Browser), etc., are configured to implement various application tasks.
  • a program for implementing the text classification method using machine learning provided by the embodiments of the present disclosure may be included in the application program.
  • the processor 51 calls the program or instruction stored in the memory 52, specifically, the program or instruction stored in the application program, and the processor 51 is configured to execute the application machine learning provided by the embodiment of the present disclosure.
  • the text classification method using machine learning may be applied to the processor 51 or implemented by the processor 51 .
  • the processor 51 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by a hardware integrated logic circuit in the processor 51 or an instruction in the form of software.
  • the above-mentioned processor 51 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA ready-made programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the text classification method applying machine learning provided by the embodiments of the present disclosure may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor.
  • the software unit may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 52, and the processor 51 reads the information in the memory 52 and completes the steps of the method in combination with its hardware.
  • FIG. 6 is an exemplary flowchart of a text classification method applying machine learning according to an embodiment of the present disclosure.
  • the execution body of the method is an electronic device.
  • the following embodiments use the electronic device as the execution body to describe the process of the text classification method applying machine learning.
  • step 601 in response to the text classification application creation instruction of the text classification task, the configuration information of the text classification application and the text annotation data are obtained; wherein the text annotation data includes a text column and a label column.
  • the text annotation data includes a text column and a label column.
  • each row of text in the text column corresponds to one or more tags.
  • a text classification application is created based on the text classification application configuration information; wherein, the text classification application is a first service program instance configured to perform model solution exploration based on the text classification application configuration information and text annotation data to obtain a model solution .
  • step 603 in response to the instruction to start the text classification application, the text classification application is deployed online, and a text classification service address is generated, so that the text classification application provides an online estimation service for the text classification task based on the text classification service address; Among them, the online estimation service is performed based on the online related data of the text classification task.
  • the text classification application is further configured to perform model self-learning based on online related data and model solutions to obtain an online text classification model.
  • the method further includes:
  • the text classification application conducts model self-learning based on online related data, labeled intermediate data, reflow labeled data, and model solutions, and obtains an online text classification model.
  • the method further includes: deploying an online text classification model online to provide a batch estimation service for text classification tasks.
  • the batch estimation service includes: providing a batch estimation service interface, and the online text classification model obtains the data set to be batch estimated for the text classification task based on the batch estimation service interface, and based on the batch estimation service interface The dataset outputs batch estimation results.
  • the data set to be estimated in batches includes a plurality of data columns, and the plurality of data columns includes a text column;
  • the online text classification model outputs batch estimation results based on the data set to be batch estimated, including:
  • deploying the online text classification model includes: replacing the deployed and online machine learning model with the online text classification model.
  • the method prior to creating the instruction by the text classification application in response to the text classification task, the method further comprises:
  • a user interface Provide a user interface, receive text classification scenarios and text classification tasks input by users based on the user interface, and receive user-triggered text classification application creation instructions based on the user interface, wherein the text classification application creation instructions and user-input text classification scenarios and text classification corresponding to the task.
  • creating a text classification application based on the text classification application configuration information includes:
  • the second service program instance is configured to perform model solution exploration based on text classification application configuration information and text annotation data to obtain a model solution;
  • the third service program instance is configured to perform model self-learning based on online related data and model solutions to obtain an online text classification model.
  • the text classification application configuration information includes one or more of the following:
  • the text classification application is configured to perform model solution exploration based on data in the first database, text classification application configuration information, and text annotation data to obtain a model solution; wherein the model solution includes the following solution sub-items: feature Engineering solutions, model algorithms, and model hyperparameters; accordingly, deploying the text classification application online includes: deploying the model solution obtained through exploration.
  • the text classification application explores the model scheme, not only can generate the model scheme, but also can generate the offline model corresponding to the model scheme.
  • the online text classification model is obtained by training the offline model, that is, the text classification application is trained offline.
  • the model obtains an online text model; wherein, the offline model is a model generated during the process of exploring the model solution by the text classification application.
  • the text classification application can perform model self-learning based on the model algorithm in the model solution and the hyperparameters of the model to generate an online text classification Model.
  • Embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores programs or instructions, and the programs or instructions cause a computer to execute various embodiments of text classification such as applying machine learning In order to avoid repeated description, the steps are not repeated here.
  • the text classification application can be regarded as a "learning circle", and the “learning circle” is deployed online to provide online estimation services, which can be understood as an online application; the online text classification model obtained by the "learning circle” self-learning is deployed and launched online Provides a batch estimation service, which can be understood as a batch application. Users only need to upload the data set to be estimated, and at the same time point out the text column that needs to be estimated. The batch application will automatically write the estimated result to the predict label after the data set. in the column. 7 to 15 are schematic diagrams of interfaces related to the text classification process by applying machine learning, which will not be repeated.
  • a person without professional knowledge of machine learning can create a text classification application by specifying text classification scenarios, text classification tasks, and text classification application configuration information, thereby reducing the implementation cost of NLP text classification capabilities.
  • model self-learning can be performed based on online related data and model solutions to obtain an online text classification model, so as to realize automatic model construction and reduce model construction costs.
  • the reusable data configured for model building is obtained.
  • a batch estimation service for text classification tasks can be provided.
  • the model solution obtained from exploration, and the intermediate data generated by the batch estimation service the model can be self-learned and the model can be automatically updated iteratively.

Abstract

一种应用机器学习的文本分类方法、装置、电子设备和存储介质。方法包括:响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据(601);基于文本分类应用配置信息,创建文本分类应用,文本分类应用至少被配置为基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案(602);响应于启动文本分类应用的指令,将文本分类应用部署上线,并生成文本分类服务地址,以使文本分类应用基于文本分类服务地址,提供针对文本分类任务的在线预估服务(603)。可见,对于不具有机器学习专业知识的人员,能够通过指定文本分类场景、文本分类任务和文本分类应用配置信息,自动创建文本分类应用,降低文本分类能力落地成本。

Description

应用机器学习的文本分类方法、装置和电子设备
本公开要求于2020年10月30日提交中国专利局、申请号为202011196878.9、发明名称为“应用机器学习的文本分类方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开实施例涉及机器学习技术领域,具体涉及一种应用机器学习的文本分类方法、装置、电子设备和非暂态计算机可读存储介质。
背景技术
自然语言处理(Natural Language Processing,NLP)文本分类在各行各业的应用越来越多,但是NLP文本分类还是一个需要很强专业技能的领域。在很多场景中,除了对中文文本进行分类,也有很多对英文文本进行分类的需求,对同时满足中文与英文分类的能力的产品有很高的诉求。
目前,NLP文本分类模型的构建主要是由专业建模人员通过手动编写代码或者使用图形化界面(如DAG,Directed Acyclic Graph,有向无环图)的方式来实现,然而,这两种实现方式均需要投入很高的人力成本和时间成本,才能得到一个比较满意的模型,不利于模型的大规模探索和应用。另外,不仅是软件成本,对于很多小型场景来说,服务器的图形处理器(Graphics Processing Unit,GPU)采购也是一批很大的费用。因此,有必要提供一种应用机器学习的文本分类方案。
发明内容
本公开实施例的一个目的是提供一种应用机器学习的文本分类的新方案。
第一方面,本公开实施例提出一种应用机器学习的文本分类方法,包括:
响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据;所述文本标注数据包括一个文本列和一个标签列;
基于所述文本分类应用配置信息,创建文本分类应用;其中,所述文本分类应用为第一服务程序实例,被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
响应于启动所述文本分类应用的指令,将所述文本分类应用部署上线,并生成文本分类服务地址,以使所述文本分类应用基于所述文本分类服务地址,提供针对所述文本分类任务的在线预估服务;其中,所述在线预估服务基于所述文本分类任务的线上相关数据进行。
在一些实施例中,所述文本列中每一行文本对应一个或多个标签。
在一些实施例中,所述文本分类应用还被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述方法还包括:
将所述线上相关数据回流到第一数据库中;
将所述模型方案基于所述线上相关数据生成的带标签的中间数据回流到所述第一数据库中;
提供回流数据标注接口,以获取回流标注数据,并将所述回流标注数据再次回流到所述第一数据库中,其中,所述回流标注数据为针对回流到所述第一数据库中的数据进行标注得到;
相应地,所述文本分类应用基于所述线上相关数据、所述带标签的中间数据、所述回流标注数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述方法还包括:将所述在线文本分类模型部署上线,以提供针对所述文本分类任务的批量预估服务。
在一些实施例中,所述批量预估服务包括:提供一个批量预估服务接口,在线文本分类模型基于所述批量预估服务接口获取所述文本分类任务的待批量预估的数据集,并基于所述待批量预估的数据集输出批量预估结果。
在一些实施例中,所述待批量预估的数据集包括多个数据列,所述多个数据列包括一个文本列;
所述基于所述待批量预估的数据集输出批量预估结果包括:
基于所述待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列;
将所述预估标签列与所述待批量预估的数据集进行拼接,得到批量预估结果并输出。
在一些实施例中,所述将所述在线文本分类模型部署上线包括:将所述在线文本分类模型替换已部署上线的机器学习模型。
在一些实施例中,所述响应于文本分类任务的文本分类应用创建指令之前,所述方法还包括:
提供用户界面,基于所述用户界面接收用户输入的文本分类场景和文本分类任务,以及基于所述用户界面接收用户触发的文本分类应用创建指令,所述文本分类应用创建指令与用户输入的文本分类场景和文本分类任务相对应。
在一些实施例中,所述基于所述文本分类应用配置信息,创建文本分类应用包括:
将所述文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用;
其中,所述第二服务程序实例被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
其中,所述第三服务程序实例被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述文本分类应用配置信息包括如下中的一种或多种:
模型自学习的频率;
模型自学习的评价指标;
数据集语言;
使用GPU加速;
模型的评估数据占比。
在一些实施例中,所述文本分类应用,被配置为基于所述第一数据库中的数据、所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;其中,所述模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数;
相应地,所述将所述文本分类应用部署上线包括:将探索得到的模型方案部署上线。
在一些实施例中,所述在线文本分类模型通过训练离线模型得到;其中,所述离线模型为所述模型方案探索的过程中产生的模型,且将探索得到的模型方案部署上线时,还将所述离线模型部署上线。
在一些实施例中,所述在线文本分类模型为基于所述模型方案中的模型算法和模型的超参数生成的模型;且将探索得到的模型方案部署上线时,没有将离线模型部署上线。
第二方面,本公开实施例还提出一种应用机器学习的文本分类装置,包括:
文本分类应用创建模块,被配置为响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据;所述文本标注数据包括一个文本列和一个标签列;基于所述文本分类应用配置信息,创建文本分类应用;其中,所述文本分类应用为第一服务程序实例,被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
文本分类应用启动模块,被配置为响应于启动所述文本分类应用的指令,将所述文本分类应用部署上线,并生成文本分类服务地址,以使所述文本分类应用基于所述文本分类服务地址,提供针对所述文本分类任务的在线预估服务;其中,所述在线预估服务基于所述文本分类任务的线上相关数据进行。
在一些实施例中,所述文本列中每一行文本对应一个或多个标签。
在一些实施例中,所述文本分类应用还被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述文本分类应用还被配置为将所述线上相关数据回流到第一数据库中,将所述模型方案基于所述线上相关数据生成的带标签的中间数据回流到所述第一数据库中;
所述应用机器学习的文本分类装置还包括标注接口模块,所述标注接口模块被配置为提供回流数据标注接口,以获取回流标注数据,并将所述回流标注数据再次回流到所述第一数据库中,其中,所述回流标注数据为针对回流到所述第一数据库中的数据进行标注得到;
相应地,所述文本分类应用基于所述线上相关数据、所述带标签的中间数据、所述回流标注数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述文本分类应用启动模块还被配置为将所述在线文本分类模型部署上线,以提供针对所述文本分类任务的批量预估服务。
在一些实施例中,所述文本分类应用启动模块还被配置为提供一个批量预估服务接口,在线文本分类模型基于所述批量预估服务接口获取所述文本分类任务的待批量预估的数据集,并基于所述待批量预估的数据集输出批量预估结果。
在一些实施例中,所述待批量预估的数据集包括多个数据列,所述多个数据列包括一个文本列;
所述在线文本分类模型基于所述待批量预估的数据集输出批量预估结果包括:基于所述待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列,将所述预估标签列与所述待批量预估的数据集进行拼接,得到批量预估结果并输出。
在一些实施例中,所述文本分类应用启动模块将所述在线文本分类模型部署上线包括:将所述在线文本分类模型替换已部署上线的机器学习模型。
在一些实施例中,所述文本分类应用创建模块还被配置为:
响应于文本分类任务的文本分类应用创建指令之前,提供用户界面,基于所述用户界面接收用户输入的文本分类场景和文本分类任务,以及基于所述用户界面接收用户触发的文本分类应用创建指令,所述文本分类应用创建指令与用户输入的文本分类场景和文本分类任务相对应。
在一些实施例中,所述文本分类应用创建模块基于所述文本分类应用配置信息,创建文本分类应用包括:
将所述文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用;
其中,所述第二服务程序实例被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
其中,所述第三服务程序实例被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述文本分类应用配置信息包括如下中的一种或多种:
模型自学习的频率;
模型自学习的评价指标;
数据集语言;
使用GPU加速;
模型的评估数据占比。
在一些实施例中,所述文本分类应用,被配置为基于所述第一数据库中的数据、所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;其中,所述模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数;
相应地,所述文本分类应用启动模块将所述文本分类应用部署上线包括:将探索得到的模型方案部署上线。
在一些实施例中,所述在线文本分类模型通过训练离线模型得到;其中,所述离线模型为所述模型方案探索的过程中产生的模型,且将探索得到的模型方案部署上线时,还将所述离线模型部署上线。
在一些实施例中,所述在线文本分类模型为基于所述模型方案中的模型算法和模型的超参数生成的模型;且将探索得到的模型方案部署上线时,没有将离线模型部署上线。
第三方面,本公开实施例还提出一种电子设备,包括:处理器和存储器;所述处理器通过调用所述存储器存储的程序或指令,被配置为执行如第一方面任一实施例的方法步骤。
第四方面,本公开实施例还提出一种非暂态计算机可读存储介质,被配置为存储程序或指令,所述程序或指令使计算机执行如第一方面任一实施例的方法步骤。
可见,本公开的至少一个实施例中,对于不具有机器学习专业知识的人员,能够通过指定文本分类场景、文本分类任务和文本分类应用配置信息,创建文本分类应用,降低NLP文本分类能力落地成本。
在一些实施例中,可基于线上相关数据和模型方案,进行模型自学习,得到在线文本分类模型,实现模型自动构建,降低模型构建成本。
在一些实施例中,通过对文本分类场景的数据进行管理(包括但不限于场景拼接等),得到能够复用的用于模型构建的数据。
在一些实施例中,通过将构建的模型部署上线,可提供针对文本分类任务的批量预估服务。另外,利用获取的线上数据、探索得到的模型方案和批量预估服务产生的中间数据,可进行模型自学习,实现模型自动迭代更新。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单 地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种应用机器学习的文本分类的示例性应用场景图;
图2为本公开实施例提供的一种场景模块的示例性框图;
图3为本公开实施例提供的一种应用机器学习的文本分类装置的示例性框图;
图4为本公开实施例提供的一种文本分类应用提供在线预估服务或批量预估服务的示例性架构图;
图5为本公开实施例提供的一种电子设备的示例性框图;
图6为本公开实施例提供的一种应用机器学习的文本分类方法的示例性流程图;
图7至图15为本公开实施例提供的应用机器学习的文本分类过程相关的界面示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。基于所描述的本公开的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。
由于目前NLP文本分类模型的构建主要是由专业建模人员来完成,而专业建模人员的培养成本较大,导致专业建模人员的缺口在短期内无法得到快速弥补,不利于NLP文本分类模型的大规模探索和应用。
另外,NLP文本分类模型上线一段时间后,NLP文本分类模型的预估效果会衰减,因此又需要专业建模人员进行重新建模调优,即每隔一段时间周期就需要重复的人员投入,使得建模越多,人员成本越高。因此,如何实现模型自动迭代更新,以确模型保预估效果,也是亟需解决的问题。
为此,本公开实施例提供一种应用机器学习的文本分类的方案,能将NLP文本分类应用从问题定义,到建模,再到模型上线服务,离线批量预测以及之后反馈的收集和模型的迭代更新形成学习闭环。通过自动进行NLP文本分类模型训练和自动进行NLP文本分类应用上线,从根本上解决了NLP文本分类能力落地困难、成本高的难题。使没有NLP相关经验的人员,也可以完成文本分类场景落地的全流程。同时支持CPU与GPU两种模式,没有GPU的小型场景也可以完成文本分类的探索。同时支持中文文本与英文文本的分类,进一步扩大了场景覆盖的范围。
在一些实施例中,非专业建模人员或没有NLP相关经验的人员可以指定文本分类场景、文本分类任务、文本分类应用配置信息和文本标注数据,进而应用机器学习的文本分类的方案可基于文本分类应用配置信息自动创建文本分类应用,其中,文本分类应用可以基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案;进而可将创建的文本分类应用部署上线,并生成文本分类服务地址,以使文本分类应用基于文本分类服务地址,提供针对文本分类任务的在线预估服务;其中,在线预估服务基于文本分类任务的线上相关数据进行。可见,本公开实施例中,创建的文本分类应用可以自动探索模型方案,实现模型构建过程全自动化,进而降低建模成本。
在一些实施例中,创建的文本分类应用还可以基于获取的所述文本分类任务的线上相关数据和探索得到的 模型方案,进行模型自学习,得到在线文本分类模型。可见,本公开实施例中,创建的文本分类应用可以进行模型自学习,实现模型自动迭代更新,确保模型预估效果。在没有专业建模人员和机器学习知识储备的情况下,也能低门槛地构建出模型和应用。
图1为本公开实施例提供的一种应用机器学习的文本分类的示例性应用场景图。如图1所示,场景模块11和应用机器学习的文本分类装置12可与文本分类场景对接。文本分类场景可由用户指定,更进一步地,用户还可指定在该文本分类场景下的文本分类任务,文本分类任务可以有多个,针对每个文本分类任务,应用机器学习的文本分类装置12可以创建对应的文本分类应用,例如,一个任务创建一个文本分类应用;不同任务创建不同的文本分类应用。文本分类应用被配置为处理对应的文本分类任务,例如实时预估任务或批量预估任务。其中,实时预估是接收到一个预估请求(即文本分类请求)就进行预估(即文本分类);而批量预估是非实时预估,通过定时或事件触发地以批量的方式进行预估,例如,在满足预设的批量预估条件时,才对多个预估请求进行批量预估,例如,当预估请求积累预设条数,才进行批量预估。
场景模块11,被配置为实现文本分类场景定义。在一些实施例中,文本分类场景定义可由用户完成,例如,用户可以定义新闻分类场景。相应地,场景模块11接收用户输入的场景定义信息。在一些实施例中,场景模块11可提供用户界面,用户通过用户界面输入场景定义信息,以指定文本分类场景和文本分类任务。在一些实施例中,场景定义信息可包括但不限于以下一个或多个:场景名称、任务名称、任务ID、任务的相关数据定义等。不同任务的相关数据定义不同。在一些实施例中,相关数据定义可以为数据表模式(Schema)定义。在一些实施例中,Schema定义包括但不限于以下一个或多个:一个或多个数据表的名称、每个数据表包括的字段、多个数据表之间的数据关系。
在一些实施例中,场景模块11,还被配置为数据接入。例如,场景模块11基于文本分类场景的文本分类任务,获取文本分类任务的相关数据。在一些实施例中,相关数据可包括但不限于:请求数据、曝光数据和反馈数据。
场景定义信息可包括但不限于:
1)场景的名称、备注等基本信息。用于识别和区分场景。
2)定义相关数据。相关数据包括但不限于:请求数据、曝光数据、反馈数据和业务数据。请求数据是指发送给文本分类应用的信息,例如有一万条新闻,结合其他信息想要通过文本分类应用来进行分类,这一万条新闻就是请求数据;经由文本分类应用预估后,实际看新闻时不会一万条新闻都看,业务方(或客户)只会选择感兴趣的新闻去看,可能只看了100条,这100条就是曝光数据;最后,实际看到的新闻属于什么类别则是反馈数据。除了请求数据、曝光数据和反馈数据外,场景中可能还包含业务数据,业务数据是其他可能有助于提升文本分类应用预估效果的信息,例如客户的基本信息、客户的备注信息等BO(Business Object)数据。其中,业务数据可能没有,也可能有多个。
a)定义各个相关数据流(请求数据流、曝光数据流、反馈数据流和业务数据流)的Schema,例如,每个数据流中都包含哪些字段,具体需要配置的信息包括字段名、字段类型和字段备注(可选填)等。
b)需要指出的是,在获取请求数据和曝光数据后,会根据请求数据和曝光数据以内连接(inner join)的方式构造出行为数据,行为数据可用于后续模型方案探索和模型自学习。
3)定义数据描述信息和各个数据表之间的关系。具体地:
a)定义行为数据的时间字段。在行为数据中选择一个时间类型字段作为主时间字段,这个时间字段应为行为的实际发生时间。
b)定义反馈数据的反馈字段(label)及类型。在反馈数据中选出其中的label字段,对于二分类场景来说label就是表示正负样本的1或0,在回归场景中label是一个表示实际情况的连续值,比如PM2.5的数值。在选出label字段后还需要选择label属于什么类型,是二分类还是回归,或者是多分类。
c)定义行为数据和反馈数据的拼接字段。定义行为数据和反馈数据各自以什么字段为关联key来进行拼接,其中拼接key还能支持多组key,即在多个字段均相等的情况下才认为某条行为数据是和某条反馈数据相对应。
d)定义各个数据中字段的标记类型和是否使用。对于某种字段类型,为了自动建模能够正确识别业务含义以取得更好的效果,需要用户来指定具体的标记类型,比如对于int类型字段,是连续类型还是离散类型,这会决定自动建模算法对于该列数据做怎样的数据变化策略。另外,还需要标记每一列数据是否在模型方案探索中使用,因为文本分类场景中,可能会存在某些字段是无意义的字段,可以不纳入学习,或者某些字段与label是存在强相关,不应纳入学习,所以需要在使用过程中进行标注。
e)定义数据表之间的关系。需要定义行为数据与业务数据之间是怎样的数据关系,方便自动建模时能够完成多表的自动建模。关系类型包含但不限于1:1、1:N,其中,在1:N的关系类型中,还需要指定副表(假设表A和表B,表B的数据会被拼到表A中去,那么表B称之为表A的副表)的表类型,是事件表还是切片表。数据表拼接不仅支持行为表与业务数据表连接,也支持业务数据表与业务数据表之间直接进行连接。
完成以上定义后,可以正式创建一个文本分类场景。文本分类场景创建后,场景模块11会自动启动一个数据拼接任务,用以将请求数据和曝光数据拼接出行为数据,以备后续模型方案探索和模型自学习使用。
应用机器学习的文本分类装置12,被配置为实现文本分类应用配置。在一些实施例中,文本分类应用配置可由用户完成,例如,使用哪些业务数据参与模型方案探索和模型自学习,又例如,模型自学习所使用的数据范围。相应地,应用机器学习的文本分类装置12可接收用户输入的文本分类应用配置信息和文本标注数据,文本标注数据用于进行模型方案探索,其中,文本标注数据至少包括一个文本列和一个标签列,文本列中存储文本数据,标签列中存储标注数据,文本列中每一行文本对应一个或多个标签。在一些实施例中,应用机器学习的文本分类装置12可提供用户界面,用户通过用户界面输入文本分类应用配置信息,并且上传文本标注数据。
在一些实施例中,从用户角度来看,用户在进行场景定义后,会想要创建对应的文本分类应用,因此,应用机器学习的文本分类装置12可提供用户界面,基于用户界面接收用户触发的文本分类应用创建指令,文本分类应用创建指令与用户输入的文本分类场景和文本分类任务相对应。在用户触发了文本分类应用创建的指令后,例如,用户点击了“文本分类应用创建”按钮,那么应用机器学习的文本分类装置12可响应于文本分类应用创建指令,显示用户界面,以通过该用户界面获取用户输入的文本分类应用配置信息和文本标注数据。
在一些实施例中,文本分类应用配置信息可包括但不限于以下一种或多种:
模型自学习的频率;
模型自学习的评价指标;
数据集语言;
使用GPU加速;
模型的评估数据占比。
其中,模型自学习的频率可以配置为每次有新数据到来时触发学习或每次新增“N”条数据后再触发学习。
其中,模型自学习的评价指标包括:P(精确率)、R(召回率)、F1、ACC(准确率)四种指标中的一种或多种。
其中,数据集语言包括中文、英文两种语言模型,根据用户选择的语言类型不同,自动匹配不同的模型进行训练。
其中,使用GPU加速(开关):开启后将使用GPU进行模型的训练,若没有GPU资源,可以关闭该开关,使用CPU资源也可以进行模型的训练。
其中,模型的评估数据占比:每次学习时,使用多少比例的数据量进行模型评估。本实施例中,默认值为8%。
在一些实施例中,文本分类应用配置信息还可包括但不限于以下一种或多种:
模型方案探索和模型自学习所使用的业务数据;
模型方案探索和模型自学习所使用的数据范围;
模型方案探索和模型自学习的算力等级;
模型的评估数据范围;
模型是否自动上线;
是否使用模型方案探索得到的离线模型。
其中,算力等级可以理解为模型方案探索和模型自学习的复杂度。算力等级越高,模型方案探索和模型自学习会在更宽阔的搜索空间进行搜索,模型自学习得到的模型的预估效果越好。
其中,模型的评估数据范围指定了用于评估模型自学习产出的模型的数据范围。
其中,模型是否自动上线,指定了模型自学习不断迭代更新产生的模型是否自动上线。若设置模型自动上线,则模型自学习产生的模型效果优于已部署上线的模型时,会将模型自学习产生的模型自动上线。若设置模型不自动上线,则只能通过手动方式上线模型自学习产出的模型。
其中,是否使用模型方案探索得到的离线模型,指定了是否将离线模型上线,若不使用离线模型,只将模型方案上线后,模型方案不会输出预估结果,向文本分类场景输出的是一个默认预估结果(例如,默认预测值),默认预测值例如为0.5,需要等待模型自学习产出模型并上线后,才能由模型输出预估结果。若使用离线模型,也即将模型方案上线的同时,还将离线模型上线,离线模型可以输出预估结果,但是由于模型方案探索使用的数据可能与线上数据存在差异,因此,离线模型的预估效果可能较差。
在一些实施例中,应用机器学习的文本分类装置12,还被配置为创建文本分类应用。例如,应用机器学习的文本分类装置12基于文本分类应用配置信息,创建文本分类应用;其中,文本分类应用为第一服务程序实例,被配置为基于文本分类应用配置信息和用户上传的文本标注数据进行模型方案探索,得到模型方案。模型方案是用于建模的多种策略的集合,例如包括但不限于:如何筛选数据、如何构建特征,如何调优模型超参数、如何选择模型、如何训练模型等策略。在一些实施例中,应用机器学习的文本分类装置12创建文本分类 应用后,可显示用户界面,以提示用户文本分类应用创建完成,用户可以触发文本分类应用启动的指令,例如,用户可点击用户界面上的“文本分类应用启动”按钮。
在一些实施例中,应用机器学习的文本分类装置12,还被配置为部署文本分类应用上线。例如,应用机器学习的文本分类装置12可响应于启动文本分类应用的指令,将文本分类应用部署上线,并生成文本分类服务地址,以使文本分类应用基于文本分类服务地址,提供针对文本分类任务的在线预估服务;其中,在线预估服务基于文本分类任务的线上相关数据进行。其中,文本分类应用还被配置为基于线上相关数据和探索得到的模型方案,进行模型自学习,得到在线文本分类模型。其中,模型自学习可以定时或事件触发地使用线上相关数据进行模型自动学习,使得最新的数据信息和业务变化也能被模型学习到,确保自学习模型效果持续良好。
在一些实施例中,场景模块11的功能可集成到应用机器学习的文本分类装置12中。
图2为本公开实施例提供的一种场景模块20的示例性框图。在一些实施例中,场景模块20可以实现为图1中的场景模块11或者场景模块11的一部分。
如图2所示,场景模块20可划分为多个单元,例如包括但不限于:数据接入单元21、场景拼接单元22和数据管理单元23。
数据接入单元21,被配置为与文本分类场景进行数据对接。在一些实施例中,数据接入单元21可基于文本分类场景的文本分类任务,获取文本分类任务的相关数据。在一些实施例中,数据接入单元21可获取文本分类任务的相关数据定义,进而基于相关数据定义与文本分类场景进行数据对接,获取文本分类任务的相关数据。
在一些实施例中,数据接入单元21可基于文本分类任务的相关数据定义,创建相关数据定义对应的数据接口,进而通过数据接口获取文本分类任务的相关数据。其中,数据接口以动态数据表或数据组为接口,或,数据接口为封装接口,封装接口是将动态数据表和数据组封装得到的统一接口。
在一些实施例中,数据接口以动态数据表或数据组为接口。数据接入单元21以动态数据表或数据组作为数据存储载体,动态数据表是指在数据表创建完成后仍然可以往其中增加数据(append)的数据表,数据组是指一系列同构(数据字段相同)数据切片的组合,新增数据时通过往数据组中新增数据切片的形式来实现数据的append。本实施例中,用户需要使用更多的数据来进行训练或预估时,是以对应的动态数据表或数据组为接口,导入数据,例如文本标注数据。导入数据的方式包括但不限于单次导入、定时导入和流式导入的一种或多种。其中,流式导入例如为Kafka(分布式发布订阅消息系统)导入等。从数据源上,支持本地导入、数据库导入、FTP(File Transfer Protocol,文件传输协议)导入、HDFS(Hadoop Distributed File System,Hadoop分布式文件系统)导入、hive(基于Hadoop的数据仓库工具)导入等多种方式,满足不同文本分类场景的数据导入需求。
在一些实施例中,数据接口为封装接口,封装接口是将动态数据表和数据组封装得到的统一接口。本实施例中,通过将动态数据表和数据组封装为统一的数据接口,不对用户暴露底层数据存储实现,提高用户的使用体验。例如,通过封装接口,对用户只需暴露请求(Request)数据、曝光(Impression)数据、反馈数据和业务数据四类数据接口,用户只需要感知这四个数据接口,不用再感知里面具体对应的数据组是什么。
场景拼接单元22,被配置为将相关数据中的请求数据和曝光数据进行拼接,得到行为数据,例如文本分类行为。在一些实施例中,场景拼接单元22根据请求数据和曝光数据以内连接(inner join)的方式构造出行 为数据(也可以称为样本数据)。
在一些实施例中,场景拼接单元22可使用过滤器(filter)进行处理和压平(flatten)处理请求数据和曝光数据,构造出行为数据。
例如,场景拼接单元22可使用过滤器(filter)基于曝光数据对请求数据进行过滤,得到交集数据;进而通过压平(flatten)处理交集数据得到行为数据。例如,曝光数据有10条数据,请求数据有12条数据,曝光数据和请求数据有10条相同数据,场景拼接单元22通过filter过滤,得到这10条相同数据即为交集数据,把不同数据滤除掉,进而通过压平(flatten)处理交集数据(这10条相同数据)得到行为数据。
数据管理单元23,被配置为管理第一数据库中的数据和第二数据库中的数据。在一些实施例中,第一数据库为离线数据库。例如,离线数据库可以为分布式文件存储系统(HDFS,Hadoop Distributed File System),还可以为其他离线数据库。在一些实施例中,第二数据库为在线数据库,例如实时特征存储引擎(RtiDB),也可以为其他在线数据库。
在一些实施例中,数据管理单元23可将数据接入单元21获取的文本分类任务的线上相关数据积累到第一数据库。在一些实施例中,数据管理单元23可将场景拼接单元22得到的行为数据积累到第一数据库中。在一些实施例中,数据管理单元23可将线上相关数据存储到第二数据库中。
在一些实施例中,场景模块20中各单元的划分仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如数据接入单元21、场景拼接单元22和数据管理单元23中的至少两个单元可以实现为一个单元;数据接入单元21、场景拼接单元22或数据管理单元23也可以划分为多个子单元。可以理解的是,各个单元或子单元能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能。
图3为本公开实施例提供的一种应用机器学习的文本分类装置30的示例性框图。在一些实施例中,应用机器学习的文本分类装置30可以实现为图1中的应用机器学习的文本分类装置12或者其一部分。
如图3所示,应用机器学习的文本分类装置30可划分为多个单元,例如包括但不限于:文本分类应用创建模块31和文本分类应用启动模块32。
文本分类应用创建模块31,被配置为实现文本分类应用配置并创建文本分类应用。在一些实施例中,文本分类应用创建模块31响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据。在一些实施例中,文本分类应用创建模块31提供用户界面,以用户界面接收用户触发的文本分类应用创建指令,当用户触发了文本分类应用创建指令后,文本分类应用创建模块31响应于文本分类任务的文本分类应用创建指令,显示用户界面,以通过该用户界面获取用户输入的文本分类应用配置信息和文本标注数据。
在一些实施例中,文本分类应用创建模块31基于文本分类应用配置信息,创建文本分类应用;其中,文本分类应用为第一服务程序实例,被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案。
在一些实施例中,文本分类应用可基于第一数据库中的数据(例如请求数据、样本数据、反馈数据、业务数据、曝光数据中一个或多个)和文本标注数据进行模型方案探索,得到模型方案。模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数。特征工程方案至少具有拼表功能。特征工程方案还可以具有其 他功能,例如从数据中提取特征以供模型算法或模型使用。模型算法可以为目前常用的机器学习算法,例如有监督学习算法,包括但不限于:LR(Logistic Regression,逻辑回归)、GBDT(Gradient Boosting Decision Tree,梯度提升迭代决策树)、DeepNN(Deep Neural Network,深度神经网络)等。模型的超参数是在机器学习之前预先设置的用于辅助模型训练的参数,例如聚类算法中的类别个数、梯度下降法的步长、神经网络的层数、训练神经网络的学习速率等。
在一些实施例中,文本分类应用在探索模型方案时,可生成至少两个模型方案,其中,不同模型方案之间至少有一个方案子项不同。在一些实施例中,文本分类应用基于第一数据库中的数据分别采用至少两个模型方案进行模型训练,可得到模型本身的参数,其中模型本身的参数例如:神经网络中的权重、支持向量机中的支持向量、线性回归或逻辑回归中的系数等。在一些实施例中,文本分类应用可基于机器学习模型评价指标,对至少两个模型方案所分别训练出的模型进行评价,进而基于评价结果从至少两个模型方案中进行选择,得到探索到的模型方案。其中机器学习模型评价指标例如为AUC(Area Under Curve)值等。
在一些实施例中,文本分类应用可将文本分类任务的线上相关数据回流到第一数据库中,将模型方案基于线上相关数据生成的带标签的中间数据回流到所述第一数据库中。其中,中间数据可以为预估样本(可以理解为行为数据)的宽表特征数据。在一些实施例中,应用机器学习的文本分类装置30提供回流数据标注接口,以获取回流标注数据,并将回流标注数据再次回流到第一数据库中,其中,回流标注数据为针对回流到第一数据库中的数据进行标注得到。在一些实施例中,回流数据标注接口可由专门的标注平台提供,标注平台可以根据时间与数据量的维度从回流到第一数据库中的数据进行数据调用,进行标注,标注好的数据会再次回流到第一数据库中,进一步提升模型自学习效果。
在一些实施例中,文本分类应用基于线上相关数据、带标签的中间数据、回流标注数据和模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,文本分类应用创建模块31可将文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用。其中,第二服务程序实例被配置为基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案。其中,第三服务程序实例被配置为基于文本分类任务的线上相关数据和探索得到的模型方案,进行模型自学习,得到在线文本分类模型。
文本分类应用启动模块32,被配置为部署文本分类应用上线。在一些实施例中,文本分类应用启动模块32响应于启动文本分类应用的指令,将文本分类应用部署上线,并生成文本分类服务地址,以使文本分类应用基于文本分类服务地址,提供针对文本分类任务的在线预估服务;其中,在线预估服务基于文本分类任务的线上相关数据进行。
在一些实施例中,文本分类应用启动模块32可将文本分类应用的第二服务程序实例探索得到的模型方案部署上线。相应地,部署上线的模型方案可基于文本分类任务的线上相关数据生成的带标签的中间数据。在一些实施例中,文本分类应用的第三服务程序实例可基于文本分类任务的线上相关数据、第二服务程序实例探索得到的模型方案和模型方案生成的中间数据,进行模型自学习,得到在线文本分类模型。
在一些实施例中,文本分类应用启动模块32将模型方案部署上线时,还将模型方案探索过程中得到的离线模型部署上线,离线模型是基于第一数据库(即离线数据库)中积累的文本分类任务的相关数据和文本标注数据训练得到,并且离线模型部署上线后可对文本分类场景的相关数据进行预估服务,因此,虽然线上线下特 征计算得到的数据可能不一致,但仍实现了线上线下数据同源。
在一些实施例中,文本分类应用的第三服务程序实例通过训练离线模型得到在线文本分类模型;其中,离线模型为文本分类应用的第二服务程序实例探索模型方案的过程中产生的模型,且文本分类应用启动模块32将模型方案部署上线时,还将离线模型部署上线。在一些实施例中,文本分类应用的第三服务程序实例通过模型方案中的模型算法和模型的超参数训练离线模型,更新离线模型本身的参数取值,得到在线文本分类模型。
在一些实施例中,文本分类应用启动模块32仅将模型方案部署上线,而没有将模型方案探索过程中得到的离线模型部署上线,可避免离线模型直接部署上线后由于线上特征计算和线下特征计算得到的数据存在不一致,导致部署上线的离线模型的预估效果较差的问题。另外,由于仅将模型方案部署上线,没有将离线模型部署上线,因此在并不会生成预估结果,当接收到请求数据时,向文本分类场景输出的是默认的预估结果,文本分类场景接收到默认的预估结果后不予理会。
在一些实施例中,文本分类应用的第三服务程序实例可基于文本分类任务的线上相关数据、基于第二服务程序实例探索得到的模型方案中的模型算法和模型的超参数、以及模型方案生成的中间数据,进行模型自学习,生成在线文本分类模型;且文本分类应用启动模块32将模型方案部署上线时,没有将离线模型部署上线。
在一些实施例中,文本分类应用启动模块32可将在线文本分类模型部署上线,以使在线文本分类模型提供针对文本分类任务的批量预估服务。在一些实施例中,文本分类应用启动模块32可提供一个批量预估服务接口,该批量预估服务接口被配置为获取文本分类任务的待批量预估的数据集。相应地,部署上线的在线文本分类模型可通过该批量预估服务接口获取待批量预估的数据集,并基于待批量预估的数据集输出批量预估结果。
在一些实施例中,待批量预估的数据集包括多个数据列,多个数据列包括一个文本列,还可包括其他列,例如ID列和备注列等。在线文本分类模型基于待批量预估的数据集输出批量预估结果包括:基于待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列,将预估标签列与待批量预估的数据集进行拼接,得到批量预估结果并输出。
在一些实施例中,以一条请求数据为例,在线文本分类模型接收到一条请求数据时,基于部署上线的模型方案中的特征工程方案,利用第二数据库中的数据和接收的请求数据进行线上实时特征计算,得到预估样本的特征数据。在一些实施例中,在线文本分类模型接收到请求数据时,基于部署上线的模型方案中的特征工程方案,对第二数据库中的数据和接收的请求数据进行拼表和线上实时特征计算得到宽表特征数据,得到的预估样本的特征数据为宽表特征数据。
在一些实施例中,在线文本分类模型可基于部署上线的模型方案得到预估样本的特征数据(或宽表特征数据),拼接特征数据和反馈数据生成带特征和反馈的样本数据,样本数据还可包括其他数据,例如时间戳数据等。在一些实施例中,在线文本分类模型拼接特征数据和反馈数据之前,拼接特征数据和曝光数据,得到带曝光数据的特征数据;进而拼接带曝光数据的特征数据和反馈数据,生成带曝光、特征和反馈的样本数据。在一些实施例中,在线文本分类模型将带特征和反馈的样本数据回流到第一数据库中,以便进行模型自学习,模型自学习得到的在线文本分类模型可部署上线,保证模型自学习用到的数据和特征工程方案分别与模型在线预估服务用到的数据和特征工程方案是一致的,实现模型自学习效果和模型预估效果一致性。
在一些实施例中,文本分类应用的第三服务程序实例进行模型自学习的过程为:基于带特征和反馈的样本 数据,通过模型方案中的模型算法和模型的超参数进行训练,得到在线文本分类模型。
在一些实施例中,文本分类应用将探索得到的模型方案部署上线包括:将探索得到的模型方案替换已部署上线的模型方案。
在一些实施例中,文本分类应用启动模块32可将在线文本分类模型替换已部署上线的机器学习模型;或,将在线文本分类模型部署上线,并与已部署上线的机器学习模型共同提供针对文本分类任务的批量预估服务。
在一些实施例中,应用机器学习的文本分类装置30中各单元的划分仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如文本分类应用创建模块31和文本分类应用启动模块32可以实现为一个单元;文本分类应用创建模块31或文本分类应用启动模块32也可以划分为多个子单元。可以理解的是,各个单元或子单元能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能。
图4为本公开实施例提供的一种文本分类应用提供在线预估服务或批量预估服务的示例性架构图。如图4所示,文本分类应用至少具有两个功能:模型方案探索和模型自学习。在一些实施例中,文本分类应用可以为图1中应用机器学习的文本分类装置12创建的文本分类应用,且在部署文本分类应用上线后,将文本分类应用探索得到的模型方案也部署上线,并将文本分类应用通过模型自学习得到的在线文本分类模型也部署上线。
结合图4,文本分类应用提供在线预估服务或批量预估服务的过程描述如下:
在定义文本分类场景后,可与文本分类场景进行数据对接,实现数据管理,数据管理例如为图2所示的数据管理单元23的功能。当文本分类应用上线后,文本分类应用的第二服务程序实例可基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案,进而可将模型方案部署上线提供在线预估服务或批量预估服务(实质上不会输出预估结果,输出的是默认预估结果,因此图中用虚线表示),模型方案会将中间数据回流。文本分类应用的第三服务程序实例可基于回流的中间数据和模型方案,进行模型自学习,产出在线文本分类模型,进而可将在线文本分类模型部署上线提供在线预估服务或批量预估服务。
可将,图4中,数据管理、模型自学习、在线预估服务(或批量预估服务)构成小闭环;数据管理、模型方案探索、在线预估服务(或批量预估服务)构成大闭环。其中,小闭环保证模型自学习用到的数据和特征工程方案分别与批量预估服务用到的数据和特征工程方案相同,实现模型自学习效果和模型预估效果一致性。大闭环保证模型方案探索用到的数据(简称线下数据)和批量预估服务用到的数据(简称线上数据)是同源的,实现了线下线上的数据同源。
图5是本公开实施例提供的一种电子设备的结构示意图。在一些实施例中,图1中应用机器学习的文本分类装置可以设置于电子设备中或实现为电子设备。
如图5所示,电子设备包括:至少一个处理器51、至少一个存储器52和至少一个通信接口53。电子设备中的各个组件通过总线系统54耦合在一起。通信接口53,被配置为与外部设备之间的信息传输。可理解地,总线系统54被配置为实现这些组件之间的连接通信。总线系统54除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但为了清楚说明起见,在图5中将各种总线都标为总线系统54。
可以理解,本实施例中的存储器52可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。
在一些实施方式中,存储器52存储了如下的元素,可执行单元或者数据结构,或者他们的子集,或者他们的扩展集:操作系统和应用程序。
其中,操作系统,包含各种系统程序,例如框架层、核心库层、驱动层等,被配置为实现各种基础任务以及处理基于硬件的任务。应用程序,包含各种应用程序,例如媒体播放器(Media Player)、浏览器(Browser)等,被配置为实现各种应用任务。实现本公开实施例提供的应用机器学习的文本分类方法的程序可以包含在应用程序中。
在本公开实施例中,处理器51通过调用存储器52存储的程序或指令,具体的,可以是应用程序中存储的程序或指令,处理器51被配置为执行本公开实施例提供的应用机器学习的文本分类方法各实施例的步骤。
本公开实施例提供的应用机器学习的文本分类方法可以应用于处理器51中,或者由处理器51实现。处理器51可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器51中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器51可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本公开实施例提供的应用机器学习的文本分类方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。软件单元可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器52,处理器51读取存储器52中的信息,结合其硬件完成方法的步骤。
图6为本公开实施例提供的一种应用机器学习的文本分类方法的示例性流程图。该方法的执行主体为电子设备,为便于描述,以下实施例中以电子设备为执行主体说明应用机器学习的文本分类方法的流程。
如图6所示,在步骤601中,响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据;其中,文本标注数据包括一个文本列和一个标签列。在一些实施例中,文本列中每一行文本对应一个或多个标签。
在步骤602中,基于文本分类应用配置信息,创建文本分类应用;其中,文本分类应用为第一服务程序实例,被配置为基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案。
在步骤603中,响应于启动文本分类应用的指令,将文本分类应用部署上线,并生成文本分类服务地址,以使文本分类应用基于文本分类服务地址,提供针对文本分类任务的在线预估服务;其中,在线预估服务基于文本分类任务的线上相关数据进行。在一些实施例中,文本分类应用还被配置为基于线上相关数据和模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,所述方法还包括:
将线上相关数据回流到第一数据库中;
将模型方案基于线上相关数据生成的带标签的中间数据回流到第一数据库中;
提供回流数据标注接口,以获取回流标注数据,并将回流标注数据再次回流到第一数据库中,其中,回流标注数据为针对回流到第一数据库中的数据进行标注得到;
相应地,文本分类应用基于线上相关数据、带标签的中间数据、回流标注数据和模型方案,进行模型自学 习,得到在线文本分类模型。
在一些实施例中,所述方法还包括:将在线文本分类模型部署上线,以提供针对文本分类任务的批量预估服务。
在一些实施例中,批量预估服务包括:提供一个批量预估服务接口,在线文本分类模型基于批量预估服务接口获取文本分类任务的待批量预估的数据集,并基于待批量预估的数据集输出批量预估结果。
在一些实施例中,待批量预估的数据集包括多个数据列,多个数据列包括一个文本列;
在线文本分类模型基于所述待批量预估的数据集输出批量预估结果包括:
基于待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列;将预估标签列与待批量预估的数据集进行拼接,得到批量预估结果并输出。
在一些实施例中,将在线文本分类模型部署上线包括:将在线文本分类模型替换已部署上线的机器学习模型。
在一些实施例中,响应于文本分类任务的文本分类应用创建指令之前,所述方法还包括:
提供用户界面,基于用户界面接收用户输入的文本分类场景和文本分类任务,以及基于用户界面接收用户触发的文本分类应用创建指令,其中,文本分类应用创建指令与用户输入的文本分类场景和文本分类任务相对应。
在一些实施例中,基于文本分类应用配置信息,创建文本分类应用包括:
将文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用;
其中,第二服务程序实例被配置为基于文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案;
其中,第三服务程序实例被配置为基于线上相关数据和模型方案,进行模型自学习,得到在线文本分类模型。
在一些实施例中,文本分类应用配置信息包括如下中的一种或多种:
模型自学习的频率;
模型自学习的评价指标;
数据集语言;
使用GPU加速;
模型的评估数据占比。
在一些实施例中,文本分类应用,被配置为基于第一数据库中的数据、文本分类应用配置信息和文本标注数据进行模型方案探索,得到模型方案;其中,模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数;相应地,将文本分类应用部署上线包括:将探索得到的模型方案部署上线。
文本分类应用进行探索模型方案,不仅可以生成模型方案,而且还可以生成该模型方案对应的离线模型。
在一些实施例中,若将探索得到的模型方案部署上线时,还将离线模型部署上线,则在进行模型自学习时,在线文本分类模型通过训练离线模型得到,也即文本分类应用通过训练离线模型得到在线文本模型;其中,离线模型为文本分类应用探索模型方案的过程中产生的模型。
在一些实施例中,若将探索得到的模型方案部署上线时,没有将离线模型部署上线,则文本分类应用可以 基于模型方案中的模型算法和模型的超参数进行模型自学习,生成在线文本分类模型。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员能够理解,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其他顺序或者同时进行。另外,本领域技术人员能够理解,说明书中所描述的实施例均属于可选实施例。
本公开实施例还提出一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储程序或指令,所述程序或指令使计算机执行如应用机器学习的文本分类各实施例的步骤,为避免重复描述,在此不再赘述。
在一些实施例中,文本分类应用可以看作是“学习圈”,“学习圈”部署上线提供在线预估服务,可以理解为在线应用;“学习圈”自学习得到的在线文本分类模型部署上线提供批量预估服务,可以理解为批量应用,用户只需要上传需要预估的数据集,同时指出需要预估的文本列即可,批量应用会自动将预估结果写入数据集后面的predict label列中。图7至图15为应用机器学习的文本分类过程相关的界面示意图,不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本公开的范围之内并且形成不同的实施例。
本领域的技术人员能够理解,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
虽然结合附图描述了本公开的实施方式,但是本领域技术人员可以在不脱离本公开的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。
工业实用性
本公开的至少一个实施例中,对于不具有机器学习专业知识的人员,能够通过指定文本分类场景、文本分类任务和文本分类应用配置信息,创建文本分类应用,降低NLP文本分类能力落地成本。在一些实施例中,可基于线上相关数据和模型方案,进行模型自学习,得到在线文本分类模型,实现模型自动构建,降低模型构建成本。在一些实施例中,通过对文本分类场景的数据进行管理(包括但不限于场景拼接等),得到能够复用的被配置为模型构建的数据。在一些实施例中,通过将构建的模型部署上线,可提供针对文本分类任务的批量预估服务。另外,利用获取的线上数据、探索得到的模型方案和批量预估服务产生的中间数据,可进行模型自学习,实现模型自动迭代更新。

Claims (30)

  1. 一种应用机器学习的文本分类方法,包括:
    响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据;所述文本标注数据包括一个文本列和一个标签列;
    基于所述文本分类应用配置信息,创建文本分类应用;其中,所述文本分类应用为第一服务程序实例,至少被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
    响应于启动所述文本分类应用的指令,将所述文本分类应用部署上线,并生成文本分类服务地址,以使所述文本分类应用基于所述文本分类服务地址,提供针对所述文本分类任务的在线预估服务;其中,所述在线预估服务基于所述文本分类任务的线上相关数据进行。
  2. 根据权利要求1所述的方法,其中,所述文本列中每一行文本对应一个或多个标签。
  3. 根据权利要求1或2所述的方法,其中,所述文本分类应用还被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  4. 根据权利要求1至3任一项所述的方法,其中,所述方法还包括:
    将所述线上相关数据回流到第一数据库中;
    将所述模型方案基于所述线上相关数据生成的带标签的中间数据回流到所述第一数据库中;
    提供回流数据标注接口,以获取回流标注数据,并将所述回流标注数据再次回流到所述第一数据库中,其中,所述回流标注数据为针对回流到所述第一数据库中的数据进行标注得到;
    相应地,所述文本分类应用基于所述线上相关数据、所述带标签的中间数据、所述回流标注数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  5. 根据权利要求3或4所述的方法,其中,所述方法还包括:将所述在线文本分类模型部署上线,以提供针对所述文本分类任务的批量预估服务。
  6. 根据权利要求5所述的方法,其中,所述批量预估服务包括:提供一个批量预估服务接口,在线文本分类模型基于所述批量预估服务接口获取所述文本分类任务的待批量预估的数据集,并基于所述待批量预估的数据集输出批量预估结果。
  7. 根据权利要求6所述的方法,其中,所述待批量预估的数据集包括多个数据列,所述多个数据列包括一个文本列;
    所述基于所述待批量预估的数据集输出批量预估结果包括:
    基于所述待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列;
    将所述预估标签列与所述待批量预估的数据集进行拼接,得到批量预估结果并输出。
  8. 根据权利要求5至7任一项所述的方法,其中,所述将所述在线文本分类模型部署上线包括:将所述在线文本分类模型替换已部署上线的机器学习模型。
  9. 根据权利要求1至8任一项所述的方法,其中,所述响应于文本分类任务的文本分类应用创建指令之前,所述方法还包括:
    提供用户界面,基于所述用户界面接收用户输入的文本分类场景和文本分类任务,以及基于所述用户界面接收用户触发的文本分类应用创建指令,所述文本分类应用创建指令与用户输入的文本分类场景和文本分类任 务相对应。
  10. 根据权利要求3至9任一项所述的方法,其中,所述基于所述文本分类应用配置信息,创建文本分类应用包括:
    将所述文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用;
    其中,所述第二服务程序实例被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
    其中,所述第三服务程序实例被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  11. 根据权利要求3至10任一项所述的方法,其中,所述文本分类应用配置信息包括如下中的一种或多种:
    模型自学习的频率;
    模型自学习的评价指标;
    数据集语言;
    使用GPU加速;
    模型的评估数据占比。
  12. 根据权利要求4至11任一项所述的方法,其中,所述文本分类应用,被配置为基于所述第一数据库中的数据、所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;其中,所述模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数;
    相应地,所述将所述文本分类应用部署上线包括:将探索得到的模型方案部署上线。
  13. 根据权利要求12所述的方法,其中,所述在线文本分类模型通过训练离线模型得到;其中,所述离线模型为所述模型方案探索的过程中产生的模型,且将探索得到的模型方案部署上线时,还将所述离线模型部署上线。
  14. 根据权利要求12所述的方法,其中,所述在线文本分类模型为基于所述模型方案中的模型算法和模型的超参数生成的模型;且将探索得到的模型方案部署上线时,没有将离线模型部署上线。
  15. 一种应用机器学习的文本分类装置,包括:
    文本分类应用创建模块,被配置为响应于文本分类任务的文本分类应用创建指令,获取文本分类应用配置信息和文本标注数据;所述文本标注数据包括一个文本列和一个标签列;基于所述文本分类应用配置信息,创建文本分类应用;其中,所述文本分类应用为第一服务程序实例,被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
    文本分类应用启动模块,被配置为响应于启动所述文本分类应用的指令,将所述文本分类应用部署上线,并生成文本分类服务地址,以使所述文本分类应用基于所述文本分类服务地址,提供针对所述文本分类任务的在线预估服务;其中,所述在线预估服务基于所述文本分类任务的线上相关数据进行。
  16. 根据权利要求15所述的装置,其中,所述文本列中每一行文本对应一个或多个标签。
  17. 根据权利要求15或16所述的装置,其中,所述文本分类应用还被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  18. 根据权利要求15至17任一项所述的装置,其中,所述文本分类应用还被配置为将所述线上相关数据回流到第一数据库中,将所述模型方案基于所述线上相关数据生成的带标签的中间数据回流到所述第一数据库中;
    所述应用机器学习的文本分类装置还包括标注接口模块,所述标注接口模块被配置为提供回流数据标注接口,以获取回流标注数据,并将所述回流标注数据再次回流到所述第一数据库中,其中,所述回流标注数据为针对回流到所述第一数据库中的数据进行标注得到;
    相应地,所述文本分类应用基于所述线上相关数据、所述带标签的中间数据、所述回流标注数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  19. 根据权利要求17或18所述的装置,其中,所述文本分类应用启动模块还被配置为将所述在线文本分类模型部署上线,以提供针对所述文本分类任务的批量预估服务。
  20. 根据权利要求19所述的装置,其中,所述文本分类应用启动模块还被配置为提供一个批量预估服务接口,在线文本分类模型基于所述批量预估服务接口获取所述文本分类任务的待批量预估的数据集,并基于所述待批量预估的数据集输出批量预估结果。
  21. 根据权利要求20所述的装置,其中,所述待批量预估的数据集包括多个数据列,所述多个数据列包括一个文本列;
    所述在线文本分类模型基于所述待批量预估的数据集输出批量预估结果包括:基于所述待批量预估的数据集中的文本列进行批量预估,得到一个预估标签列,将所述预估标签列与所述待批量预估的数据集进行拼接,得到批量预估结果并输出。
  22. 根据权利要求19至21任一项所述的装置,其中,所述文本分类应用启动模块将所述在线文本分类模型部署上线包括:将所述在线文本分类模型替换已部署上线的机器学习模型。
  23. 根据权利要求15至22任一项所述的装置,其中,所述文本分类应用创建模块还被配置为:
    响应于文本分类任务的文本分类应用创建指令之前,提供用户界面,基于所述用户界面接收用户输入的文本分类场景和文本分类任务,以及基于所述用户界面接收用户触发的文本分类应用创建指令,所述文本分类应用创建指令与用户输入的文本分类场景和文本分类任务相对应。
  24. 根据权利要求17至23任一项所述的装置,其中,所述文本分类应用创建模块基于所述文本分类应用配置信息,创建文本分类应用包括:
    将所述文本分类应用配置信息、第二服务程序实例和第三服务程序实例打包为文本分类应用;
    其中,所述第二服务程序实例被配置为基于所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;
    其中,所述第三服务程序实例被配置为基于所述线上相关数据和所述模型方案,进行模型自学习,得到在线文本分类模型。
  25. 根据权利要求17至24任一项所述的装置,其中,所述文本分类应用配置信息包括如下中的一种或多种:
    模型自学习的频率;
    模型自学习的评价指标;
    数据集语言;
    使用GPU加速;
    模型的评估数据占比。
  26. 根据权利要求18至25任一项所述的装置,其中,所述文本分类应用,被配置为基于所述第一数据库中的数据、所述文本分类应用配置信息和所述文本标注数据进行模型方案探索,得到模型方案;其中,所述模型方案包括以下方案子项:特征工程方案、模型算法和模型的超参数;
    相应地,所述文本分类应用启动模块将所述文本分类应用部署上线包括:将探索得到的模型方案部署上线。
  27. 根据权利要求26所述的装置,其中,所述在线文本分类模型通过训练离线模型得到;其中,所述离线模型为所述模型方案探索的过程中产生的模型,且将探索得到的模型方案部署上线时,还将所述离线模型部署上线。
  28. 根据权利要求26所述的装置,其中,所述在线文本分类模型为基于所述模型方案中的模型算法和模型的超参数生成的模型;且将探索得到的模型方案部署上线时,没有将离线模型部署上线。
  29. 一种电子设备,包括:处理器和存储器;
    所述处理器通过调用所述存储器存储的程序或指令,被配置为执行如权利要求1至14任一项所述方法的步骤。
  30. 一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储程序或指令,所述程序或指令使计算机执行如权利要求1至14任一项所述方法的步骤。
PCT/CN2021/127675 2020-10-30 2021-10-29 应用机器学习的文本分类方法、装置和电子设备 WO2022089613A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011196878.9A CN114443831A (zh) 2020-10-30 2020-10-30 应用机器学习的文本分类方法、装置和电子设备
CN202011196878.9 2020-10-30

Publications (1)

Publication Number Publication Date
WO2022089613A1 true WO2022089613A1 (zh) 2022-05-05

Family

ID=81357261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127675 WO2022089613A1 (zh) 2020-10-30 2021-10-29 应用机器学习的文本分类方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN114443831A (zh)
WO (1) WO2022089613A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197633A (zh) * 2017-11-24 2018-06-22 百年金海科技有限公司 基于TensorFlow的深度学习图像分类与应用部署方法
CN108875045A (zh) * 2018-06-28 2018-11-23 第四范式(北京)技术有限公司 针对文本分类来执行机器学习过程的方法及其系统
CN110597958A (zh) * 2019-09-12 2019-12-20 苏州思必驰信息科技有限公司 文本分类模型训练和使用方法及装置
US20200005194A1 (en) * 2018-06-30 2020-01-02 Microsoft Technology Licensing, Llc Machine learning for associating skills with content
CN111210023A (zh) * 2020-01-13 2020-05-29 哈尔滨工业大学 数据集分类学习算法自动选择系统及方法
CN111339304A (zh) * 2020-03-16 2020-06-26 闪捷信息科技有限公司 一种基于机器学习的文本数据自动分类方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197633A (zh) * 2017-11-24 2018-06-22 百年金海科技有限公司 基于TensorFlow的深度学习图像分类与应用部署方法
CN108875045A (zh) * 2018-06-28 2018-11-23 第四范式(北京)技术有限公司 针对文本分类来执行机器学习过程的方法及其系统
US20200005194A1 (en) * 2018-06-30 2020-01-02 Microsoft Technology Licensing, Llc Machine learning for associating skills with content
CN110597958A (zh) * 2019-09-12 2019-12-20 苏州思必驰信息科技有限公司 文本分类模型训练和使用方法及装置
CN111210023A (zh) * 2020-01-13 2020-05-29 哈尔滨工业大学 数据集分类学习算法自动选择系统及方法
CN111339304A (zh) * 2020-03-16 2020-06-26 闪捷信息科技有限公司 一种基于机器学习的文本数据自动分类方法

Also Published As

Publication number Publication date
CN114443831A (zh) 2022-05-06

Similar Documents

Publication Publication Date Title
US20190124020A1 (en) Chatbot Skills Systems And Methods
EP3244312B1 (en) A personal digital assistant
US20190103111A1 (en) Natural Language Processing Systems and Methods
WO2022048648A1 (zh) 实现自动构建模型的方法、装置、电子设备和存储介质
US9471213B2 (en) Chaining applications
US10453165B1 (en) Computer vision machine learning model execution service
US11308940B2 (en) Counterfactual annotated dialogues for conversational computing
CN115617327A (zh) 低代码页面搭建系统、方法及计算机可读存储介质
CN108171528B (zh) 一种归因方法及归因系统
WO2021228264A1 (zh) 一种应用机器学习的方法、装置、电子设备及存储介质
US11960517B2 (en) Dynamic cross-platform ask interface and natural language processing model
CN117008923B (zh) 基于ai大模型的代码生成和编译部署方法、平台和设备
CN110633959A (zh) 基于图结构的审批任务创建方法、装置、设备及介质
US10776351B2 (en) Automatic core data service view generator
CN116775183A (zh) 基于大语言模型的任务生成方法、系统、设备及存储介质
WO2023040143A1 (zh) 云服务的资源编排方法、装置、设备及存储介质
US20190079649A1 (en) Ui rendering based on adaptive label text infrastructure
WO2022089613A1 (zh) 应用机器学习的文本分类方法、装置和电子设备
WO2022135592A1 (zh) 模型训练程序镜像的生成方法、装置、设备及存储介质
US10896161B2 (en) Integrated computing environment for managing and presenting design iterations
US20230195742A1 (en) Time series prediction method for graph structure data
CN109814857B (zh) 一种可定制图元联动的方法及装置
WO2020006090A1 (en) Skill-generating method, apparatus, and electonic device
CN111124386A (zh) 基于Unity的动画事件处理方法、装置、设备和存储介质
Tian Application and analysis of artificial intelligence graphic element algorithm in digital media art design

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21885339

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21885339

Country of ref document: EP

Kind code of ref document: A1