CN111563163A - Text classification model generation method and device and data standardization method and device - Google Patents

Text classification model generation method and device and data standardization method and device Download PDF

Info

Publication number
CN111563163A
CN111563163A CN202010358131.2A CN202010358131A CN111563163A CN 111563163 A CN111563163 A CN 111563163A CN 202010358131 A CN202010358131 A CN 202010358131A CN 111563163 A CN111563163 A CN 111563163A
Authority
CN
China
Prior art keywords
text
sample
data
training
cleaned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010358131.2A
Other languages
Chinese (zh)
Inventor
刘襄雄
鄢小征
林威扬
翟永强
毕永辉
叶阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN202010358131.2A priority Critical patent/CN111563163A/en
Publication of CN111563163A publication Critical patent/CN111563163A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for generating a text classification model and a method and a device for standardizing data. One embodiment of the text classification model comprises: acquiring a sample text set, wherein sample texts in the sample text set correspond to pre-labeled sample category information; carrying out data cleaning on sample texts in the sample text set to obtain a cleaned sample set; selecting the cleaned text from the cleaned text set as a training text to obtain a training text set; determining feature data of each training text in a training text set; and taking the obtained feature data as input, taking sample category information corresponding to the input feature data as expected output, and training to obtain a text classification model. The method and the device can effectively improve the accuracy of the text classification model obtained by training, are beneficial to classifying data by using the text classification model, automatically carry out standardized processing on the data and improve the efficiency of data standardization.

Description

Text classification model generation method and device and data standardization method and device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating a text classification model and a method and a device for standardizing data.
Background
With the mass popularization of computers and smart phones, the information amount generated in the information era is rapidly increased, the data acquisition, storage and processing quantity is also increased by times when the information exists in various large websites, blogs and communities under the internet, on the internet and in scattered clothes, eating and housing of people. Meanwhile, large-scale information system construction of departments such as governments, public inspection and the like has the characteristics of stage property and distribution property, and has inconsistent structures and inconsistent field names, so that a large amount of resources cannot be accurately and effectively understood and applied in integration and application;
the method is characterized in that a large data engineering project is built, data aggregation and standardization are necessary projects, and how to quickly and effectively carry out unified standards on data resource structures and fields is a challenge facing the current.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method and an apparatus for generating a text classification model, and a method and an apparatus for standardizing data, so as to solve the technical problems mentioned in the above background.
In a first aspect, an embodiment of the present application provides a method for generating a text classification model, where the method includes: acquiring a sample text set, wherein sample texts in the sample text set correspond to pre-labeled sample category information; carrying out data cleaning on sample texts in the sample text set to obtain a cleaned sample set; selecting the cleaned text from the cleaned text set as a training text to obtain a training text set; determining feature data of each training text in a training text set; and taking the obtained feature data as input, taking sample category information corresponding to the input feature data as expected output, and training to obtain a text classification model.
In some embodiments, data cleaning is performed on sample texts in the sample text set to obtain a cleaned text set, including: for each sample text in the sample text set, performing word segmentation on the sample text to obtain a word sequence corresponding to the sample text; and extracting key words from the obtained word sequence to be used as cleaned texts corresponding to the sample texts.
In some embodiments, extracting key terms from the obtained term sequence as a cleaned text corresponding to the sample text includes: and taking the words except the stop words, the characters and the numbers in the obtained word sequence as key words to obtain the cleaned text corresponding to the sample text.
In some embodiments, selecting the cleaned text from the cleaned text set as a training text to obtain a training text set, including: determining the extraction probability corresponding to the cleaned text of the same sample type in the cleaned text set; and extracting the cleaned text as a training text according to the corresponding extraction probability from the cleaned text under each sample category to obtain a training text set.
In some embodiments, after training the text classification model, the method further comprises: acquiring a test text set, wherein the test texts in the test text set correspond to pre-labeled label category information; determining feature data of each test text in the test text set; inputting each obtained feature data into a text classification model to obtain text classification information; determining the classification accuracy of the text classification model based on the obtained text classification information and the label classification information; in response to determining that the classification accuracy is less than or equal to the preset accuracy threshold, the text classification model is retrained.
In a second aspect, an embodiment of the present application provides a data normalization method, including: acquiring data to be classified, wherein the data to be classified comprises a description field; cleaning the description field to obtain a cleaned text; determining feature data of the cleaned text; inputting the characteristic data into a pre-trained text classification model to obtain class information of the data to be classified, wherein the text classification model is obtained by pre-training based on the method described in any embodiment of the first aspect; and based on the category information, carrying out standardization processing on the data to be classified to obtain standardized data.
In a third aspect, an embodiment of the present application provides an apparatus for generating a text classification model, where the apparatus includes: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a sample text set, and sample texts in the sample text set correspond to pre-labeled sample category information; the cleaning module is used for cleaning the data of the sample texts in the sample text set to obtain a cleaned sample set; the selection module is used for selecting the cleaned text from the cleaned text set as a training text to obtain a training text set; the first determination module is used for determining the characteristic data of each training text in the training text set; and the first training module is used for taking the obtained feature data as input, taking the sample class information corresponding to the input feature data as expected output, and training to obtain the text classification model.
In a fourth aspect, an embodiment of the present application provides a data normalization apparatus, including: the device comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified, and the data to be classified comprises description fields; the cleaning module is used for cleaning the description field to obtain a cleaned text; the determining module is used for determining the characteristic data of the cleaned text; the generating module is used for inputting the feature data into a pre-trained text classification model to obtain class information of the data to be classified, wherein the text classification model is obtained by pre-training based on the method described in any embodiment of the first aspect; and the processing module is used for carrying out standardization processing on the data to be classified based on the class information to obtain standardized data.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement the method as described in any one of the implementations of the first and second aspects.
In a sixth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first and second aspects.
According to the method and the device for generating the text classification model and the method and the device for standardizing the data, provided by the embodiment of the application, the sample text is subjected to data cleaning to obtain the cleaned text set, the cleaned text is selected from the cleaned text set to serve as the training text, and finally the text classification model is obtained by training the training text, so that the accuracy of the text classification model obtained by training can be effectively improved, the text classification model is favorably used for classifying the data, the data is automatically standardized, and the efficiency of standardizing the data is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method of generating a text classification model according to the present application;
FIG. 3 is a flow diagram of yet another embodiment of a method of generating a text classification model according to the present application;
FIG. 4 is a flow diagram of one embodiment of a data normalization method according to the present application;
FIG. 5 is a schematic diagram of an embodiment of an apparatus for generating a text classification model according to the application;
FIG. 6 is a schematic block diagram of one embodiment of a data normalization apparatus according to the present application;
FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates an exemplary system architecture 100 to which the data normalization method of embodiments of the present application may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a data management application, a search application, a web browser application, a shopping application, an instant messaging tool, etc., may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices including, but not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), etc., and a fixed terminal such as a digital TV, a desktop computer, etc.
The server 103 may be a server that provides various services, such as a background server that processes text, data, and the like uploaded by the terminal apparatus 101. The background server can process various received texts and obtain processing results (such as trained text classification models, standardized data and the like).
It should be noted that the generation method or the data normalization method of the text classification model provided in the embodiment of the present application may be executed by the terminal device 101 or the server 103, and accordingly, the generation device or the data normalization device of the text classification model may be disposed in the terminal device 101 or the server 103.
It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, intermediate devices, and servers, as desired for implementation. In the case where data does not need to be acquired from a remote location, the system architecture described above may not include a network, and only include a server or a terminal device.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating a text classification model according to the present application is shown. The method comprises the following steps:
step 201, a sample text set is obtained.
In this embodiment, the execution subject of the data normalization method (e.g., the terminal device or the server shown in fig. 1) may obtain the sample text set locally or remotely. Wherein the sample texts in the sample text set correspond to pre-labeled sample category information. The sample class information is used to characterize the class of the sample text. As an example, the sample category information may be represented by a vector, e.g., (1, 0, 0, 0, 0 …) for characterizing the sample text as belonging to the first class and (0, 1, 0, 0, 0 …) for characterizing the sample text as belonging to the second class.
Step 202, performing data cleaning on the sample texts in the sample text set to obtain a cleaned text set.
In this embodiment, the execution subject may perform data cleaning on the sample text in the sample text set, so as to obtain a cleaned text set. The data washing is used for extracting the most critical information from the sample text, and the information can represent the category of the sample text.
The execution body may perform data cleaning in various ways, for example, removing various symbols in the sample text.
In some optional implementations of this embodiment, step 202 may be performed as follows:
for each sample text in the sample text set, performing the following steps:
step one, performing word segmentation on the sample text to obtain a word sequence corresponding to the sample text.
The executing body can cut words of the sample text according to various word cutting algorithms. By way of example, the word-cutting algorithm may include, but is not limited to, at least one of: dictionary-based methods, statistical-based segmentation, rule-based segmentation, and the like.
And step two, extracting key words from the obtained word sequence to serve as cleaned texts corresponding to the sample texts.
Wherein, the key words are words capable of characterizing the main content of the sample text. By way of example, common words (e.g., "i", "y", etc.) may be removed from the sequence of words, with the remaining words as cleaned text.
According to the implementation mode, the keywords are extracted from the word sequence corresponding to the sample text, so that the influence of other meaningless words on the main content of the text can be avoided, and the accuracy of the training text classification model is improved.
In some optional implementations of this embodiment, the step two may be performed as follows:
and taking the words except the stop words, the characters and the numbers in the obtained word sequence as key words to obtain the cleaned text corresponding to the sample text. Specifically, english characters, numeric characters, and other characters may be removed using ascii code rules, regular expressions, and the like. Each word in the word sequence can be matched with a preset stop word set, so that stop words in the word sequence are removed.
The implementation mode can accurately extract key words from the word sequence by removing stop words, characters and numbers in the word sequence, eliminates the influence of nonsense words on the main content of the expressed text, and improves the accuracy of the training text classification model.
And 203, selecting the cleaned text from the cleaned text set as a training text to obtain a training text set.
In this embodiment, the execution subject may select a cleaned text from the cleaned text set as a training text, and obtain a training text set. Wherein the execution body may select the text after washing from the text after washing set in various ways. For example, the cleaned text is randomly selected from the cleaned texts in each sample category according to the number corresponding to each sample category (i.e., the category indicated by the sample category information) set manually.
In some optional implementations of this embodiment, step 203 may be performed as follows:
firstly, the extraction probability corresponding to the cleaned text of the same sample type in the cleaned text set is determined. The extraction probability may be determined in various manners, for example, in a manner set manually. It may also be determined according to the PPS sampling method. PPS decimation is a scaled-back, size-scaled probabilistic sampling, referred to as PPS sampling for short. PPS sampling is a sampling approach that uses side information so that each unit has a probability of being decimated that is proportional to its size.
And then, extracting the cleaned text as a training text according to the corresponding extraction probability from the cleaned text in each sample category to obtain a training text set. Generally, the higher the washed text set that contains a washed text, the lower its corresponding extraction probability. Therefore, the number of the cleaned texts in each sample category can be kept balanced, and the accuracy of the trained text classification model is improved.
Step 204, determining the feature data of each training text in the training text set.
In this embodiment, the execution subject may determine feature data of each training text in the training text set. Specifically, the executing agent may determine the feature data of the training text according to various existing text feature extraction methods. For example, the text feature extraction method may include, but is not limited to, at least one of: TF-IDF (Term Frequency-Inverse text Frequency) algorithm, word vector algorithm, SVD (Singular Value Decomposition) algorithm, and the like.
And step 205, taking the obtained feature data as input, taking sample category information corresponding to the input feature data as expected output, and training to obtain a text classification model.
In this embodiment, the executing agent may use a machine learning method to train the obtained feature data as an input and the sample type information corresponding to the input feature data as an expected output to obtain the text classification model.
The training process of the model is an optimal solution solving process, the optimal solution is given in a data labeling mode, namely sample category information, and the process of fitting the model to the optimal solution is mainly carried out iteratively by a method of error minimization. Setting a loss function for input characteristic data, wherein the loss function can calculate the difference between the output of the model and the expectation, and then updating and modifying the original weight of the model by using a gradient descent algorithm.
The text model in this embodiment may be obtained by training based on various basic models, and the basic model may include, but is not limited to, at least one of the following: logistic regression models, neural network models, Support Vector Machines (SVMs), and the like.
The execution agent may train the basic model using the feature data as input and the sample type information corresponding to the input feature data as desired output by using a machine learning method, and may obtain actual output for each training of the input feature data. Wherein the actual output is data actually output by the basic model and is used for characterizing the sample class. Then, the executing body may adopt a gradient descent method and a back propagation method, adjust parameters of the basic model based on actual output and expected output, use the model obtained after each parameter adjustment as a basic model for next training, and end training under the condition that a preset training end condition is met, thereby training to obtain the text classification model.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated loss value converges using a preset loss function (e.g., a logarithmic loss function).
In some optional implementations of this embodiment, as shown in fig. 3, after step 205, the following steps may also be included:
step 206, a test text set is obtained.
And the test texts in the test text set correspond to the pre-labeled label category information. In general, the test text in the test text set and the sample text in the sample text set may be from a same initial text set. As an example, the initial text set may be split by train _ test _ split of a model _ selection module of the scimit-spare tool, resulting in a sample text set and a test text set.
Step 207, determining the feature data of each test text in the test text set.
This step is substantially the same as step 204 in the corresponding embodiment of fig. 2, and is not described here again.
And step 208, inputting each obtained feature data into a text classification model to obtain text classification information.
And step 209, determining the classification accuracy of the text classification model based on the obtained text type information and the label type information.
For example, the classification accuracy is obtained by dividing the number of correct classifications of the text classification model by the total number of test texts in the test text set.
Step 210, in response to determining that the classification accuracy is less than or equal to the preset accuracy threshold, retraining the text classification model.
Wherein, the method adopted by the retraining is consistent with the steps 201 to 205. Correspondingly, if the classification accuracy is greater than the accuracy threshold, the text classification model is determined to be trained.
According to the method provided by the embodiment of the application, the sample text is subjected to data cleaning to obtain the cleaned text set, the cleaned text is selected from the cleaned text set to serve as the training text, and the training text is finally used for training to obtain the text classification model, so that the accuracy of the text classification model obtained through training can be effectively improved, the data can be classified by using the text classification model, the data can be automatically subjected to standardized processing, and the efficiency of data standardization is improved.
With further reference to FIG. 4, a flow 400 of one embodiment of a data normalization method according to the present application is shown. The method comprises the following steps:
step 401, obtaining data to be classified.
In the present embodiment, an execution subject (e.g., a terminal device or a server shown in fig. 1) for executing the data normalization method may acquire the data to be classified therein locally or remotely. The data to be classified is generally non-standardized data, i.e. the category, structure, etc. of the data to be classified are not standardized. The data to be classified includes a description field. The description field is used for describing features such as types of data to be classified, and the description field is usually a segment of words.
And 402, cleaning the description field to obtain a cleaned text.
In this embodiment, the execution body may perform data cleaning on the description field to obtain a cleaned text. The method for data cleaning may be the same as the method described in the embodiment corresponding to fig. 2, and is not described here again.
And step 403, determining feature data of the cleaned text.
In this embodiment, step 403 is substantially the same as step 204 in the corresponding embodiment of fig. 2, and is not described herein again.
Step 404, inputting the feature data into a pre-trained text classification model to obtain the class information of the data to be classified.
In this embodiment, the executing entity may input the feature data into a pre-trained text classification model to obtain category information of the data to be classified. The text classification model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2.
And 405, standardizing the data to be classified based on the class information to obtain standardized data.
In this embodiment, the executing entity may perform a normalization process on the data to be classified based on the class information to obtain normalized data. Specifically, the data to be classified is data of which category, structure, and the like have not been subjected to the normalization processing. The execution main body may generate an identifier of the data to be classified based on the category information, and may perform a standardization operation on other information (e.g., structure, format, etc.) of the data to be classified according to a standardization mode corresponding to the category information.
In the data standardization method provided by the embodiment corresponding to fig. 4, the description fields of the data to be classified are classified by using the text classification method, and the data to be classified is standardized based on the classification result to obtain standardized data, so that the classification of the data to be classified is automatically judged, and the efficiency of data standardization is improved.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of a data normalization apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 5, the data normalization apparatus 500 of the present embodiment includes: a first obtaining module 501, configured to obtain a sample text set, where sample texts in the sample text set correspond to pre-labeled sample category information; a cleaning module 502, configured to perform data cleaning on sample texts in the sample text set to obtain a cleaned text set; a selecting module 503, configured to select a cleaned text from the cleaned text set as a training text, so as to obtain a training text set; a first determining module 504, configured to determine feature data of each training text in the training text set; the first training module 505 is configured to use the obtained feature data as input, use sample category information corresponding to the input feature data as expected output, and train to obtain a text classification model.
In this embodiment, the first obtaining module 501 may obtain the sample text set locally or remotely. Wherein the sample texts in the sample text set correspond to pre-labeled sample category information. The sample class information is used to characterize the class of the sample text. As an example, the sample category information may be represented by a vector, e.g., (1, 0, 0, 0, 0 …) for characterizing the sample text as belonging to the first class and (0, 1, 0, 0, 0 …) for characterizing the sample text as belonging to the second class.
In this embodiment, the cleaning module 502 may perform data cleaning on the sample texts in the sample text set, so as to obtain a cleaned text set. The data washing is used for extracting the most critical information from the sample text, and the information can represent the category of the sample text.
The washing module 502 may perform data washing in various ways, for example, removing various symbols in the sample text.
In this embodiment, the selecting module 503 may select a cleaned text from the cleaned text set as a training text, so as to obtain a training text set. The cleaning module 502 may select cleaned text from the cleaned text collection in various ways. For example, the cleaned text is randomly selected from the cleaned texts in each sample category according to the number corresponding to each sample category (i.e., the category indicated by the sample category information) set manually.
In this embodiment, the first determination module 504 may determine feature data of each training text in the set of training texts. Specifically, the first determining module 504 may determine the feature data of the training text according to various existing text feature extraction methods. For example, the text feature extraction method may include, but is not limited to, at least one of: TF-IDF (Term Frequency-Inverse text Frequency) algorithm, word vector algorithm, SVD (Singular Value Decomposition) algorithm, and the like.
In this embodiment, the first training module 505 may train the obtained feature data as input and the sample type information corresponding to the input feature data as expected output by using a machine learning method to obtain a text classification model.
The training process of the model is an optimal solution solving process, the optimal solution is given in a data labeling mode, namely sample category information, and the process of fitting the model to the optimal solution is mainly carried out iteratively by a method of error minimization. Setting a loss function for input characteristic data, wherein the loss function can calculate the difference between the output of the model and the expectation, and then updating and modifying the original weight of the model by using a gradient descent algorithm.
The text model in this embodiment may be obtained by training based on various basic models, and the basic model may include, but is not limited to, at least one of the following: logistic regression models, neural network models, Support Vector Machines (SVMs), and the like.
The first training module 505 may train the basic model using the feature data as input and the sample type information corresponding to the input feature data as desired output by using a machine learning method, and may obtain actual output for each training of the input feature data. Wherein the actual output is data actually output by the basic model and is used for characterizing the sample class. Then, the first training module 505 may adopt a gradient descent method and a back propagation method, adjust parameters of the basic model based on the actual output and the expected output, use the model obtained after each parameter adjustment as the basic model for the next training, and end the training when a preset training end condition is met, thereby obtaining the text classification model through training.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated loss value converges using a preset loss function (e.g., a logarithmic loss function).
In some optional implementations of this embodiment, the cleaning module 502 may include: an extracting unit (not shown in the figure), configured to perform word segmentation on each sample text in the sample text set, so as to obtain a word sequence corresponding to the sample text; and extracting key words from the obtained word sequence to be used as cleaned texts corresponding to the sample texts.
In some optional implementations of this embodiment, the extracting unit may be further configured to: and taking the words except the stop words, the characters and the numbers in the obtained word sequence as key words to obtain the cleaned text corresponding to the sample text.
In some optional implementations of this embodiment, the selecting module 503 may include: a determining unit (not shown in the figure) for determining the extraction probability corresponding to the cleaning text of the same sample type in the cleaning text set; and an extracting unit (not shown in the figure) for extracting the cleaned text from the cleaned text under each sample category according to the corresponding extraction probability to serve as a training text, so as to obtain a training text set.
In some optional implementations of this embodiment, the apparatus 500 may further include: a second obtaining module (not shown in the figure) for obtaining a test text set, wherein the test texts in the test text set correspond to pre-labeled label category information; a second determining module (not shown in the figure) for determining feature data of each test text in the test text set; a generating module (not shown in the figure) for inputting each obtained feature data into a text classification model to obtain text category information; a third determining module (not shown in the figure) for determining the classification accuracy of the text classification model based on the obtained text classification information and the labeling classification information; a second training module (not shown) for retraining the text classification model in response to determining that the classification accuracy is less than or equal to the preset accuracy threshold.
According to the device provided by the embodiment of the application, the sample text is subjected to data cleaning to obtain the cleaned text set, the cleaned text is selected from the cleaned text set to serve as the training text, and the training text is finally used for training to obtain the text classification model, so that the accuracy of the text classification model obtained through training can be effectively improved, the data can be classified by using the text classification model, the data can be automatically subjected to standardized processing, and the efficiency of data standardization is improved.
With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of a data normalization apparatus, which corresponds to the embodiment of the method shown in fig. 4, and which can be applied in various electronic devices.
As shown in fig. 6, the data normalization apparatus 600 of the present embodiment includes: an obtaining module 601, configured to obtain data to be classified, where the data to be classified includes a description field; a cleaning module 602, configured to perform data cleaning on the description field to obtain a cleaned text; a determining module 603, configured to determine feature data of the cleaned text; a generating module 604, configured to input the feature data into a pre-trained text classification model to obtain category information of data to be classified, where the text classification model is obtained by training in advance based on the method described in any embodiment of the first aspect; a processing module 605, configured to perform standardization processing on the data to be classified based on the category information to obtain standardized data
In this embodiment, the obtaining module 601 may obtain the data to be classified from a local or remote location. The data to be classified is generally non-standardized data, i.e. the category, structure, etc. of the data to be classified are not standardized. The data to be classified includes a description field. The description field is used for describing features such as types of data to be classified, and the description field is usually a segment of words.
In this embodiment, the cleaning module 602 may perform data cleaning on the description field to obtain a cleaned text. The method for data cleaning may be the same as the method described in the embodiment corresponding to fig. 2, and is not described here again.
In this embodiment, the determining module 603 is substantially the same as the cleaning module 502 in the embodiment corresponding to fig. 5, and is not described herein again.
In this embodiment, the generating module 604 may input the feature data into a pre-trained text classification model to obtain category information of the data to be classified. The text classification model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2.
In this embodiment, the processing module 605 may perform a normalization process on the data to be classified based on the class information, so as to obtain normalized data. Specifically, the data to be classified is data of which category, structure, and the like have not been subjected to the normalization processing. The processing module 605 may generate an identifier of the data to be classified based on the category information, and may perform a standardization operation on other information (for example, structure, format, and the like) of the data to be classified according to a standardization mode corresponding to the category information.
According to the device provided by the embodiment of the application, the description fields of the data to be classified are classified by using the text classification method, and the data to be classified is subjected to standardized processing based on the classification result to obtain standardized data, so that the automatic classification judgment of the data to be classified is realized, and the efficiency of the data standardized processing is improved.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application.
It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a first acquisition module, a cleaning module, a selection module, a first determination module, and a first training module. Where the names of these modules do not in some cases constitute a limitation of the unit itself, for example, the first obtaining module may also be described as a "module for obtaining a sample text set".
As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a sample text set, wherein sample texts in the sample text set correspond to pre-labeled sample category information; carrying out data cleaning on sample texts in the sample text set to obtain a cleaned sample set; selecting the cleaned text from the cleaned text set as a training text to obtain a training text set; determining feature data of each training text in a training text set; and taking the obtained feature data as input, taking sample category information corresponding to the input feature data as expected output, and training to obtain a text classification model.
Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: acquiring data to be classified, wherein the data to be classified comprises a description field; cleaning the description field to obtain a cleaned text; determining feature data of the cleaned text; inputting the characteristic data into a pre-trained text classification model to obtain class information of the data to be classified, wherein the text classification model is obtained by pre-training based on the method described in any embodiment of the first aspect; and based on the category information, carrying out standardization processing on the data to be classified to obtain standardized data.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method for generating a text classification model, the method comprising:
acquiring a sample text set, wherein sample texts in the sample text set correspond to pre-labeled sample category information;
carrying out data cleaning on the sample texts in the sample text set to obtain a cleaned sample set;
selecting a cleaned text from the cleaned text set as a training text to obtain a training text set;
determining feature data of each training text in the training text set;
and taking the obtained feature data as input, taking sample category information corresponding to the input feature data as expected output, and training to obtain a text classification model.
2. The method according to claim 1, wherein the performing data cleaning on the sample texts in the sample text set to obtain a cleaned text set comprises:
for each sample text in the sample text set, performing word segmentation on the sample text to obtain a word sequence corresponding to the sample text; and extracting key words from the obtained word sequence to be used as cleaned texts corresponding to the sample texts.
3. The method of claim 2, wherein the extracting key terms from the obtained term sequence as the cleaned text corresponding to the sample text comprises:
and taking the words except the stop words, the characters and the numbers in the obtained word sequence as key words to obtain the cleaned text corresponding to the sample text.
4. The method according to claim 1, wherein selecting the cleaned text from the cleaned text set as a training text, resulting in a training text set, comprises:
determining the extraction probability corresponding to the cleaned text of the same sample type in the cleaned text set;
and extracting the cleaned text as a training text according to the corresponding extraction probability from the cleaned text under each sample category to obtain a training text set.
5. The method of any of claims 1-4, wherein after the training results in a text classification model, the method further comprises:
acquiring a test text set, wherein test texts in the test text set correspond to pre-labeled label category information;
determining feature data of each test text in the test text set;
inputting each obtained feature data into the text classification model to obtain text classification information;
determining the classification accuracy of the text classification model based on the obtained text classification information and the labeling classification information;
in response to determining that the classification accuracy is less than or equal to a preset accuracy threshold, retraining the text classification model.
6. A method of data normalization, the method comprising:
acquiring data to be classified, wherein the data to be classified comprises a description field;
carrying out data cleaning on the description field to obtain a cleaned text;
determining feature data of the cleaned text;
inputting the characteristic data into a pre-trained text classification model to obtain class information of the data to be classified, wherein the text classification model is obtained by pre-training based on the method of one of claims 1 to 5;
and based on the category information, carrying out standardization processing on the data to be classified to obtain standardized data.
7. An apparatus for generating a text classification model, the apparatus comprising:
the device comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a sample text set, and sample texts in the sample text set correspond to pre-labeled sample category information;
the cleaning module is used for cleaning data of the sample texts in the sample text set to obtain a cleaned text set;
the selection module is used for selecting the cleaned text from the cleaned text set as a training text to obtain a training text set;
a first determining module, configured to determine feature data of each training text in the training text set;
and the first training module is used for taking the obtained feature data as input, taking the sample class information corresponding to the input feature data as expected output, and training to obtain the text classification model.
8. An apparatus for data normalization, the apparatus comprising:
the device comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified, and the data to be classified comprises description fields;
the cleaning module is used for cleaning the description field to obtain a cleaned text;
the determining module is used for determining the characteristic data of the cleaned text;
a generating module, configured to input the feature data into a pre-trained text classification model to obtain category information of the data to be classified, where the text classification model is obtained by pre-training based on the method according to any one of claims 1 to 5;
and the processing module is used for carrying out standardization processing on the data to be classified based on the category information to obtain standardized data.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202010358131.2A 2020-04-29 2020-04-29 Text classification model generation method and device and data standardization method and device Pending CN111563163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010358131.2A CN111563163A (en) 2020-04-29 2020-04-29 Text classification model generation method and device and data standardization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010358131.2A CN111563163A (en) 2020-04-29 2020-04-29 Text classification model generation method and device and data standardization method and device

Publications (1)

Publication Number Publication Date
CN111563163A true CN111563163A (en) 2020-08-21

Family

ID=72070638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010358131.2A Pending CN111563163A (en) 2020-04-29 2020-04-29 Text classification model generation method and device and data standardization method and device

Country Status (1)

Country Link
CN (1) CN111563163A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463963A (en) * 2020-11-30 2021-03-09 深圳前海微众银行股份有限公司 Method for identifying target public sentiment, model training method and device
CN113468289A (en) * 2021-07-23 2021-10-01 京东城市(北京)数字科技有限公司 Training method and device of event detection model
CN113688036A (en) * 2021-08-13 2021-11-23 北京灵汐科技有限公司 Data processing method, device, equipment and storage medium
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment
CN115617943A (en) * 2022-10-09 2023-01-17 名之梦(上海)科技有限公司 Text cleaning method, device, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595542A (en) * 2018-04-08 2018-09-28 北京奇艺世纪科技有限公司 A kind of textual classification model generates, file classification method and device
CN110019782A (en) * 2017-09-26 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for exporting text categories
CN110119786A (en) * 2019-05-20 2019-08-13 北京奇艺世纪科技有限公司 Text topic classification method and device
CN110457476A (en) * 2019-08-06 2019-11-15 北京百度网讯科技有限公司 Method and apparatus for generating disaggregated model
CN111078877A (en) * 2019-12-05 2020-04-28 支付宝(杭州)信息技术有限公司 Data processing method, training method of text classification model, and text classification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019782A (en) * 2017-09-26 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for exporting text categories
CN108595542A (en) * 2018-04-08 2018-09-28 北京奇艺世纪科技有限公司 A kind of textual classification model generates, file classification method and device
CN110119786A (en) * 2019-05-20 2019-08-13 北京奇艺世纪科技有限公司 Text topic classification method and device
CN110457476A (en) * 2019-08-06 2019-11-15 北京百度网讯科技有限公司 Method and apparatus for generating disaggregated model
CN111078877A (en) * 2019-12-05 2020-04-28 支付宝(杭州)信息技术有限公司 Data processing method, training method of text classification model, and text classification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
熊海涛: "《复杂数据分析方法及其应用研究》", 30 May 2013 *
胥桂仙: "《文本分类技术研究》", 30 June 2010 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463963A (en) * 2020-11-30 2021-03-09 深圳前海微众银行股份有限公司 Method for identifying target public sentiment, model training method and device
CN113468289A (en) * 2021-07-23 2021-10-01 京东城市(北京)数字科技有限公司 Training method and device of event detection model
CN113688036A (en) * 2021-08-13 2021-11-23 北京灵汐科技有限公司 Data processing method, device, equipment and storage medium
CN114997338A (en) * 2022-07-19 2022-09-02 成都数之联科技股份有限公司 Project classification and classification model training method, device, medium and equipment
CN115617943A (en) * 2022-10-09 2023-01-17 名之梦(上海)科技有限公司 Text cleaning method, device, equipment and computer readable storage medium
CN115617943B (en) * 2022-10-09 2023-06-30 名之梦(上海)科技有限公司 Text cleaning method, apparatus, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN109460513B (en) Method and apparatus for generating click rate prediction model
CN111563163A (en) Text classification model generation method and device and data standardization method and device
US20190163742A1 (en) Method and apparatus for generating information
CN108520470B (en) Method and apparatus for generating user attribute information
CN110555451A (en) information identification method and device
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
CN112200173B (en) Multi-network model training method, image labeling method and face image recognition method
CN113657113A (en) Text processing method and device and electronic equipment
CN111191677A (en) User characteristic data generation method and device and electronic equipment
CN113111167A (en) Method and device for extracting vehicle model of alarm receiving and processing text based on deep learning model
CN113946648B (en) Structured information generation method and device, electronic equipment and medium
CN110705308A (en) Method and device for recognizing field of voice information, storage medium and electronic equipment
CN116244146A (en) Log abnormality detection method, training method and device of log abnormality detection model
CN115292487A (en) Text classification method, device, equipment and medium based on naive Bayes
CN111079185B (en) Database information processing method and device, storage medium and electronic equipment
CN113111897A (en) Alarm receiving and warning condition type determining method and device based on support vector machine
CN113111165A (en) Deep learning model-based alarm receiving warning condition category determination method and device
CN111274383B (en) Object classifying method and device applied to quotation
CN112947928A (en) Code evaluation method and device, electronic equipment and storage medium
CN111767290A (en) Method and apparatus for updating a user representation
CN113111181B (en) Text data processing method and device, electronic equipment and storage medium
CN113806485B (en) Intention recognition method and device based on small sample cold start and readable medium
CN111131354A (en) Method and apparatus for generating information
US20240152933A1 (en) Automatic mapping of a question or compliance controls associated with a compliance standard to compliance controls associated with another compliance standard
CN117391076B (en) Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200821