CN112148841B

CN112148841B - Object classification and classification model construction method and device

Info

Publication number: CN112148841B
Application number: CN202011064067.3A
Authority: CN
Inventors: 刘阳; 周晗; 黄文瀚; 柳超
Original assignee: Beijing Jindi Credit Service Co ltd
Current assignee: Beijing Jindi Credit Service Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2024-04-19
Anticipated expiration: 2040-09-30
Also published as: CN112148841A

Abstract

The invention discloses a method and a device for classifying objects and constructing a classification model, and relates to the technical field of computers. One embodiment of the object classification method comprises the following steps: acquiring initial characteristic data of an object to be classified, wherein the initial characteristic data comprises identification information data and attribute information data of the object to be classified; word segmentation is carried out on the identification information data and the attribute information data to obtain a characteristic word set, wherein the characteristic word set comprises at least one characteristic word; and carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on a trained classification model. According to the object classification method, the word vector can be input into the trained classification model according to the feature word set of the initial feature data and the vector representation of the feature words, so that the object class of the object to be classified can be automatically, quickly and accurately determined.

Description

Object classification and classification model construction method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for object classification and classification model construction.

Background

In real-world applications, there are many scenarios where information classification is required, for example, an enterprise may be classified into a certain industry according to its own business scope. In the prior art, classification is typically performed manually. And because the business scope is too wide and redundant business in enterprises, especially small and micro enterprises, the problems of undefined classification, even wrong classification and the like occur, and the time cost is high and the efficiency is low when the industry classification is performed manually.

Disclosure of Invention

In view of this, the embodiment of the invention provides a method and a device for object classification and classification model construction, which can automatically, quickly and accurately determine the target class to which an object to be classified belongs by inputting word vectors into a trained classification model according to a feature word set of initial feature data and vector representation of feature words.

To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method of object classification.

The object classification method of the embodiment of the invention comprises the following steps: acquiring initial characteristic data of an object to be classified, wherein the initial characteristic data comprises identification information data and attribute information data of the object to be classified; word segmentation is carried out on the identification information data and the attribute information data to obtain a characteristic word set, wherein the characteristic word set comprises at least one characteristic word; and carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on a trained classification model.

Optionally, the step of performing word segmentation on the identification information data and the attribute information data to obtain a feature word set includes: performing word segmentation processing on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned; and cleaning the plurality of feature words to be cleaned to obtain cleaned feature words, wherein the cleaned feature words form a feature word set.

Optionally, the step of performing cleaning treatment on the feature words to be cleaned to obtain cleaned feature words includes: removing useless words in the plurality of feature words to be cleaned to obtain feature words after first processing; wherein, the useless word at least comprises one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value in the plurality of feature words to be cleaned according to the historical word frequencies of the plurality of feature words to be cleaned to obtain second processed feature words; the cleaned feature words comprise the first processed feature words and/or the second processed feature words.

Optionally, the step of performing word segmentation on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned includes: word segmentation is carried out on the identification information data and the attribute information data in the initial feature data to obtain a plurality of initial feature words and word sequences of each initial feature word in the corresponding identification information data or attribute information data; combining the plurality of initial feature words according to the word order to obtain combined feature words; the initial feature words and the combined feature words form the feature words to be cleaned.

Optionally, the step of acquiring initial feature data of the object to be classified, where the initial feature data includes identification information data and attribute information data of the object to be classified includes: acquiring initial characteristic data of an object to be classified, wherein the object to be classified is an enterprise to be classified, and the initial characteristic data comprises information data indicating industries to which the enterprise to be classified belongs; the information data indicating the industries to which the enterprises to be classified belong may be words indicating industries in company names, such as science and technology limited companies, intelligent information, etc., or technical development, sales, etc. in an operation scope, and may further determine which industry the enterprises belong to according to the company names and the operation scope.

In order to achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a classification model construction method.

The method for constructing the classification model comprises the following steps: respectively obtaining sample feature data of a plurality of sample objects marked with belonging categories to obtain a text corpus; the sample characteristic data comprises identification information data and attribute information data of the sample object; performing word segmentation on sample feature data in the text corpus to obtain feature dictionaries corresponding to all categories; and training to obtain a classification model based on the feature dictionary corresponding to each category and a classification algorithm.

Optionally, the step of performing word segmentation on the sample feature data in the text corpus to obtain feature dictionaries corresponding to each category includes: performing word segmentation processing on the sample feature data in the text corpus to obtain a plurality of sample feature words to be cleaned corresponding to each category; and removing useless words in the plurality of sample feature words to be cleaned, wherein the useless words at least comprise one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value from the sample feature words to be cleaned according to the word frequencies of the sample feature words to be cleaned; and obtaining cleaned sample feature words corresponding to each category, wherein the cleaned sample feature words form a feature dictionary.

Optionally, the step of performing word segmentation on the sample feature data in the text corpus to obtain feature dictionaries corresponding to each category includes: performing word segmentation on sample feature data corresponding to each category in the text corpus to obtain a plurality of initial sample feature words and word sequences of each initial sample feature word in the corresponding sample feature data; combining the plurality of initial sample feature words according to the word sequence to obtain combined sample feature words; the plurality of initial sample feature words and the combined sample feature word form the feature dictionary.

Optionally, the step of combining the plurality of initial sample feature words according to the word order to obtain a combined sample feature word includes: combining initial sample feature words of a plurality of sample objects under each category according to the word sequence to obtain a word space of each sample object, wherein the word space comprises a plurality of to-be-screened combined feature words of the sample object under multiple granularities; and calculating the similarity between word spaces of a plurality of sample objects according to each category, and screening out combined sample feature words from a plurality of combined feature words to be screened in the word spaces according to the similarity.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an apparatus for classifying objects.

The device for obtaining object classification in the embodiment of the invention comprises the following steps:

The device comprises an initial characteristic data acquisition module, a classification module and a classification module, wherein the initial characteristic data acquisition module is used for acquiring initial characteristic data of an object to be classified, and the initial characteristic data comprises identification information data and attribute information data of the object to be classified;

The characteristic word set acquisition module is used for carrying out word segmentation on the identification information data and the attribute information data to obtain a characteristic word set, wherein the characteristic word set comprises at least one characteristic word;

and the target list determining module is used for carrying out vector representation on the feature words in the feature word set and determining the target category to which the object to be classified belongs based on a trained classification model.

In order to achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a classification model construction apparatus.

The classification model construction method and device of the embodiment of the invention comprise the following steps:

The text corpus acquisition module is used for respectively acquiring sample characteristic data of a plurality of sample objects marked with belonging categories to obtain a text corpus; the sample characteristic data comprises identification information data and attribute information data of the sample object;

the feature dictionary determining module is used for carrying out word segmentation processing on the sample feature data in the text corpus to obtain feature dictionaries corresponding to all the categories;

And the training module is used for training to obtain a classification model based on the feature dictionary corresponding to each category and the classification algorithm.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

The electronic equipment of the embodiment of the invention comprises: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the object classification and classification model construction method of any of the above.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the object classification and classification model construction method of any one of the above.

One embodiment of the above invention has the following advantages or benefits: according to the initial feature data of the object to be classified, the feature word set is obtained after word segmentation, the feature words in the feature word set are subjected to vector representation, and word vectors are input into a trained classification model, so that the target category of the object to be classified can be automatically, quickly and accurately determined, and the problems of high time cost, low efficiency and the like caused by manual classification in the prior art are solved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a method of object classification according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a method of object classification according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a method of determining an industry class to which an enterprise belongs according to an embodiment of the invention;

FIG. 4 is a schematic diagram of the main flow of a classification model construction method according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a classification model construction method according to an embodiment of the invention;

FIG. 6 is a schematic diagram of the main modules of an apparatus for object classification according to an embodiment of the invention;

FIG. 7 is a schematic diagram of main modules of a classification model construction apparatus according to an embodiment of the invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

Fig. 9 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of main flow of a method for classifying objects according to an embodiment of the present invention, and as shown in fig. 1, the method for classifying objects according to an embodiment of the present invention mainly includes:

Step S101: initial feature data of the object to be classified is obtained, wherein the initial feature data comprises identification information data and attribute information data of the object to be classified.

Step S102: and performing word segmentation on the identification information data and the attribute information data to obtain a feature word set, wherein the feature word set comprises at least one feature word.

Step S103: and carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on the trained classification model.

In the embodiment of the invention, the initial characteristic data of the object to be classified at least comprises information data which can represent the category to which the object belongs. The identification information data of the object to be classified included in the initial feature data refers to information data for identifying the object, such as an object name, an object address, or an object code, and in general, the identification information data of the object also has directionality for a category to which the object belongs. The attribute information data may characterize information data of a feature of the category to which it belongs. The feature words in the feature word set are subjected to vector representation, namely each feature in the feature word set is mapped into a vector, and a vector space of the feature word set is obtained. And inputting the vectors in the vector space of the feature word set into a trained classification model, and automatically acquiring the target category to which the object to be classified belongs. Alternatively, the trained classification model is based on fastText and the FastText text classification algorithm is a simple model as proposed by Facebook AI RESEARCH. Experiments show that under general conditions, fastText algorithm can obtain the same precision as that of the depth model, but the calculation time is far less than that of the depth learning model. fastText can be used as a baseline model for text classification.

According to the embodiment of the invention, according to the initial feature data of the object to be classified, the feature word set is obtained after word segmentation, the feature words in the feature word set are subjected to vector representation, and the word vectors are input into the trained classification model, so that the target category of the object to be classified can be automatically, quickly and accurately determined, and the problems of high time cost, low efficiency and the like caused by manual classification in the prior art are solved.

In an embodiment of the present invention, in the process of performing word segmentation on the identification information data and the attribute information data to obtain the feature word set, the identification information data and the attribute information data in the initial feature data are subjected to word segmentation to obtain a plurality of feature words to be cleaned. And then, cleaning the plurality of feature words to be cleaned to obtain cleaned feature words, and combining the cleaned feature words into a feature word set. In the classifying process, some words are possibly useless or have low content of characteristic information, and after the words are washed away, the classifying and identifying accuracy is improved, and the classifying and identifying can be accelerated. Preferably, in an embodiment of the present invention, in a process of performing a cleaning process on a plurality of feature words to be cleaned to obtain a cleaned feature word, unnecessary words in the plurality of feature words to be cleaned are removed, so as to obtain a feature word after a first process. Wherein, the garbage word at least comprises one of the following: prepositions, adverbs, repeated words, pre-specified words. Pre-specified words refer to pre-specified, e.g., in some classification scenarios, where some words are not useful, these words may be designated as "pre-specified words" and then the words removed during the cleaning process. Or cleaning the plurality of feature words to be cleaned, and removing words with feature information content lower than a threshold value in the plurality of feature words to be cleaned according to the historical word frequency of the plurality of feature words to be cleaned in the process of obtaining the feature words after cleaning, so as to obtain the feature words after second treatment; the cleaned feature words comprise the first processed feature words and/or the second processed feature words. The words with the feature information content lower than the threshold value refer to words with low classification feature information, for example, in a certain classification, the number of times of occurrence in the classified historical words is relatively small, namely, the historical word frequency is low, and the words are determined to be words with the feature information content lower than the threshold value. And the threshold value can be dynamically adjusted or set by oneself, and can be determined according to service requirements, historical data and the like. Further, the information characteristic content is determined according to an algorithm such as tf-idf, and particularly according to the frequency of occurrence of words in each historical word, and if all occur, it indicates no distinction in classification.

Preferably, in an embodiment of the present invention, in the process of performing word segmentation on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned, the word segmentation is performed on the identification information data and the attribute information data in the initial feature data to obtain a plurality of initial feature words, and a word sequence of each initial feature word in the corresponding identification information data or attribute information data. Then, combining a plurality of initial feature words according to word sequences to obtain combined feature words; the plurality of initial feature words and the combined feature words form a plurality of feature words to be cleaned. For example, a software development industry may include a computer network with features, a retail industry may include computer retailers with feature words that, if cut, have similarities in the two categories of computers, but when combined, the computer network may represent features of the software development industry and the computer retailers may represent features of the retail industry. Within the same category, the compound words may add similarity within the category, mainly to account for computer networks, computer network developments, computers/networks if cut, and computers/networks/developments.

In an embodiment of the present invention, initial feature data of an object to be classified is obtained, where the initial feature data includes identification information data and attribute information data of the object to be classified, and the initial feature data includes information data indicating industries to which the enterprise to be classified belongs. Then, the identification information data and the attribute information data in the initial feature data are extracted, wherein the identification information data is an enterprise name, and the attribute information data is an operation range. In an embodiment of the present invention, the initial characteristic data of the enterprise includes a lot, such as an enterprise name, an operation scope, an enterprise address, or an enterprise legal person, but some of them have no influence on the enterprise industry classification. In the enterprise-related data, the name and the operation scope most reflect the industry characteristics of the enterprise, so that after the characteristic data of the enterprise are acquired, the enterprise name and the operation scope are extracted. Because the business scope is too wide and redundant business in enterprises, especially small micro-enterprises, the problems of undefined classification, even wrong classification and the like can occur, and the prior art has the problems of high time cost, low efficiency and the like by manually classifying industries. According to the embodiment of the invention, the industry of the enterprise can be determined according to the name and the operating range of the enterprise.

FIG. 2 is a schematic diagram of a method of object classification according to an embodiment of the invention; as shown in fig. 2, the method for classifying objects according to the embodiment of the present invention includes:

step S201: initial feature data of the object to be classified is obtained.

Step S202: and performing word segmentation processing on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned.

Step S203: and cleaning the plurality of feature words to be cleaned to obtain the cleaned feature words.

Step S204: and combining the cleaned feature words according to the word order to obtain combined feature words.

Step S205: and carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on the trained classification model.

FIG. 3 is a schematic diagram of a method of determining an industry class to which an enterprise belongs according to an embodiment of the invention; as shown in fig. 3, the method for determining the industry category to which the enterprise belongs according to the embodiment of the present invention includes:

step S301: and acquiring initial characteristic data of the enterprise to be classified.

Step S302: and extracting the enterprise name and the operation scope in the initial characteristic data.

Step S303: and preprocessing the enterprise name and the operation range, and obtaining word vectors according to the preprocessing result. The pretreatment comprises one of the following steps: and (3) word segmentation processing, removing useless words and filtering words with characteristic information content lower than a threshold value.

Step S304: and determining the industry category of the enterprise to be classified according to the word vector and the trained industry classification model.

According to the embodiment of the invention, according to the initial feature data of the object to be classified, the feature word set is obtained after word segmentation, the feature words in the feature word set are subjected to vector representation, and the word vectors are input into the trained classification model, so that the target category of the object to be classified can be automatically, quickly and accurately determined, the accuracy of industry classification is greatly improved, and the cost of manual classification is reduced.

FIG. 4 is a schematic diagram of the main flow of a classification model construction method according to an embodiment of the invention; as shown in fig. 4, the method for constructing a classification model according to the embodiment of the present invention mainly includes:

step S401: respectively obtaining sample feature data of a plurality of sample objects marked with belonging categories to obtain a text corpus; the sample characteristic data comprises identification information data and attribute information data of the sample object.

Step S402: and performing word segmentation on the sample feature data in the text corpus to obtain feature dictionaries corresponding to all the categories.

Step S403: based on the feature dictionary corresponding to each category and the classification algorithm, training to obtain a classification model.

According to the embodiment of the invention, the text corpus is obtained by obtaining the sample characteristic data containing the identification information data and the attribute information data. And performing word segmentation processing on the sample feature data in the text corpus to obtain feature dictionaries corresponding to all the categories. Further, based on the feature dictionary corresponding to each category and the classification algorithm, the classification model is obtained through training, and the classification model with high accuracy is obtained. Preferably, according to the preprocessing of cleaning useless words of the corpus and words with characteristic information content lower than a threshold value and the establishment of combined key word groups, the accuracy of training the classification model is improved.

In an embodiment of the present invention, in the process of performing word segmentation on sample feature data in a text corpus to obtain feature dictionaries corresponding to each category, the word segmentation is performed on the sample feature data in the text corpus to obtain a plurality of sample feature words to be cleaned corresponding to each category. And then, removing useless words in the plurality of sample feature words to be cleaned, wherein the useless words at least comprise one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value from the sample feature words to be cleaned according to the word frequencies of the sample feature words to be cleaned. And obtaining cleaned sample feature words corresponding to each category, wherein the cleaned sample feature words form a feature dictionary.

In an embodiment of the present invention, in a process of performing word segmentation on sample feature data in a text corpus to obtain feature dictionaries corresponding to each category, performing word segmentation on sample feature data corresponding to each category in the text corpus to obtain a plurality of initial sample feature words, and word sequences of each initial sample feature word in the corresponding sample feature data. Then, combining a plurality of initial sample feature words according to word sequences to obtain combined sample feature words; a plurality of initial sample feature words and a combined sample feature word are formed into a feature dictionary. If companies in the industry class A have A1, A2 and A3, the word segmentation results of the A1, A2 and A3 are obtained through data processing of three companies respectively, and the word segmentation results of the three companies are respectively expressed as word sets B1, B2 and B3; then, combining words in the word set B1 according to the word order to obtain a word space C1, combining words in the word set B2 according to the word order to obtain a word space C2, and combining words in the word set B3 according to the word order to obtain a word space C3; and calculating the similarity of the phrases in C1, C2 and C3 (the similarity of any two phrases), and determining the phrase with high similarity as the keyword corresponding to the industry class A.

In an embodiment of the present invention, in a process of combining a plurality of initial sample feature words according to word order to obtain a combined sample feature word, initial sample feature words of a plurality of sample objects under the category are combined according to word order for each category, so as to obtain a word space of each sample object, where the word space includes a plurality of to-be-screened combined feature words of the sample object under multiple granularities. And calculating the similarity between the word spaces of the plurality of sample objects for each category, and screening the combined sample feature words from the plurality of combined feature words to be screened in the word space according to the similarity.

FIG. 5 is a schematic diagram of a classification model construction method according to an embodiment of the invention; as shown in fig. 5, the classification model construction method according to the embodiment of the present invention includes:

Step S501: sample characteristic data of a sample enterprise is collected, including enterprise names and enterprise business scopes. In the enterprise related data, the enterprise name and the operation range can reflect the industry characteristics of the enterprise, and sample characteristic data of a plurality of sample enterprises are collected to form a text corpus. In the embodiment of the invention, more than 5 thousands of enterprise names and operating ranges can be collected from the information disclosed by the enterprises to serve as data sources for model training.

Step S502: and carrying out data preprocessing for removing the garbage words on the sample characteristic data. Because the collected information of the enterprise sample feature data (enterprise name and operation range) contains a lot of irrelevant words and word information with low information feature content, such as words of 'limited liability company', 'group' and representative regional information in the enterprise name, such as 'Beijing', the words do not contain the industry feature information of the enterprise, and the words can be filtered as irrelevant words, so that the training accuracy is improved.

Step S503: and preprocessing the data of the sample characteristic data, wherein the content of the filtered characteristic information of the data is lower than the threshold word. In the business scope, since the business scope of many enterprises contains many business services, these services may belong to a plurality of different industry fields, such as "sales"; similar industries, such as "retail and wholesale", are also possible; even the same words are embodied in different industries, such as "services". The words can bring larger interference in the process of training the classification model, can cause the classification prediction to be undefined, belong to the words with low information feature content, and are removed in the cleaning process.

Step S504: and carrying out data processing of Chinese word segmentation on the result after data preprocessing. In the text classification algorithm, the text is required to be subjected to word segmentation, a jieba word segmentation tool is used, and the filtered data is subjected to word segmentation by using different modes and precision to form a new corpus. For example, the name of an enterprise of "Beijing certain technology limited company" can be divided into phrases such as "Beijing", "certain", "technology", "limited company" and the like by word segmentation.

Step S505: and establishing a word vector model according to the result after the data processing. In the embodiment of the invention, a word vector training mode of a text classification algorithm fastText algorithm is used, feature extraction of the corpus is carried out through training word vectors of the corpus, a word vector model of the corpus is established by using a fastText algorithm, texts and words are mapped into a vector mode, and industry characteristics of corpus information are mined.

Step S506: and extracting characteristic information of the training sample according to the established word vector model. In the embodiment of the invention, in the analysis of the word vector, the word segmentation under different precision (namely, the same sentence is divided into words with different granularity) is found, and different similarity association degrees are presented in the same industry, for example, a computer network is often split into a computer and a network, and the characteristic information ratio of the split word vector is found to be reduced compared with the characteristic information of the original combined word after the split word vector is split into the computer and the network through a word vector model, so that the similarity is reduced.

Step S507: and establishing a key dictionary base according to the characteristic information of the extracted training sample. Based on the above results, a single keyword is combined by using the similarity measure of the word vector model, a combined keyword dictionary (rather than a single keyword) with more forced distinction of industry feature information is established, and more accurate industry feature information is provided for model training.

Step S508: based on the established key dictionary library and the classification algorithm, the classification model is trained. The classification algorithm can be a text classification algorithm fastText, fastText, and has the characteristics of high speed and high accuracy, and an industry classification model is trained by using a supervision mode of the fastText algorithm based on the obtained characteristic information and the corresponding industry labeling information.

FIG. 6 is a schematic diagram of the main modules of an apparatus for object classification according to an embodiment of the invention; as shown in fig. 6, the apparatus 600 for object classification according to the embodiment of the present invention includes an initial feature data acquisition module 601, a feature word set acquisition module 602, and a target class determination module 603.

The initial feature data obtaining module 601 is configured to obtain initial feature data of an object to be classified, where the initial feature data includes identification information data and attribute information data of the object to be classified.

The feature word set obtaining module 602 is configured to perform word segmentation on the identification information data and the attribute information data to obtain a feature word set, where the feature word set includes at least one feature word.

The target category determination module 603 is configured to perform vector representation on feature words in the feature word set, and determine a target category to which the object to be classified belongs based on the trained classification model.

In an embodiment of the present invention, the feature word set obtaining module is further configured to perform word segmentation on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned; and cleaning the plurality of feature words to be cleaned to obtain cleaned feature words, wherein the cleaned feature words form a feature word set. More preferably, in an embodiment of the present invention, the feature word set obtaining module is further configured to remove an useless word in the plurality of feature words to be cleaned, to obtain a feature word after the first processing; wherein, the garbage word at least comprises one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value in the plurality of feature words to be cleaned according to the historical word frequency of the plurality of feature words to be cleaned to obtain second processed feature words; the cleaned feature words comprise the first processed feature words and/or the second processed feature words. More preferably, in an embodiment of the present invention, the feature word set obtaining module is further configured to perform word segmentation on the identification information data and the attribute information data in the initial feature data to obtain a plurality of initial feature words, and a word sequence of each initial feature word in the corresponding identification information data or attribute information data; combining a plurality of initial feature words according to word sequences to obtain combined feature words; the plurality of initial feature words and the combined feature words form a plurality of feature words to be cleaned.

In an embodiment of the present invention, the initial feature data obtaining module is further configured to obtain initial feature data of an object to be classified, where the object to be classified is an enterprise to be classified, and the initial feature data includes information data indicating an industry to which the enterprise to be classified belongs; and extracting identification information data and attribute information data in the initial characteristic data, wherein the identification information data is an enterprise name, and the attribute information data is an operation range.

FIG. 7 is a schematic diagram of main modules of a classification model construction apparatus according to an embodiment of the invention; as shown in fig. 7, the classification model construction device 700 according to the embodiment of the present invention includes a text corpus acquisition module 701, a feature dictionary determination module 702, and a training module 703.

The text corpus obtaining module 701 is configured to obtain sample feature data of a plurality of sample objects marked to belong to a category, so as to obtain a text corpus; the sample characteristic data comprises identification information data and attribute information data of the sample object.

The feature dictionary determining module 702 is configured to perform word segmentation on sample feature data in the text corpus to obtain feature dictionaries corresponding to each category.

The training module 703 is configured to train to obtain a classification model based on the feature dictionary and the classification algorithm corresponding to each category.

In an embodiment of the present invention, the feature dictionary determining module is further configured to perform word segmentation processing on sample feature data in the text corpus to obtain a plurality of sample feature words to be cleaned corresponding to each category; and removing useless words in the plurality of sample feature words to be cleaned, wherein the useless words at least comprise one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value from the feature words of the sample to be cleaned according to word frequencies of the feature words of the sample to be cleaned; and obtaining cleaned sample feature words corresponding to each category, wherein the cleaned sample feature words form a feature dictionary. More preferably, in an embodiment of the present invention, the feature dictionary determining module is further configured to perform word segmentation on sample feature data corresponding to each category in the text corpus, so as to obtain a plurality of initial sample feature words, and a word sequence of each initial sample feature word in the sample feature data corresponding to the initial sample feature words; combining a plurality of initial sample feature words according to word sequences to obtain combined sample feature words; a plurality of initial sample feature words and a combined sample feature word are formed into a feature dictionary. In an embodiment of the present invention, the feature dictionary determining module is further configured to combine, for each category, initial sample feature words of a plurality of sample objects under the category according to word order, to obtain a word space of each sample object, where the word space includes a plurality of feature words to be screened of the sample object under multiple granularities; and calculating the similarity between the word spaces of the plurality of sample objects for each category, and screening the combined sample feature words from the plurality of combined feature words to be screened in the word spaces according to the similarity.

Fig. 8 illustrates an exemplary system architecture 800 to which an object classification and classification model construction method or object classification and classification model construction apparatus of an embodiment of the invention may be applied.

As shown in fig. 8, a system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 805 through the network 804 using the terminal devices 801, 802, 803 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 801, 802, 803.

The terminal devices 801, 802, 803 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 805 may be a server providing various services, such as a background management server (by way of example only) that provides support for shopping-type websites browsed by users using the terminal devices 801, 802, 803. The background management server can analyze and other data of the received product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that, the object classification and classification model construction method provided in the embodiment of the present invention is generally executed by the server 805, and accordingly, the object classification and classification model construction apparatus is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 901.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor comprises an initial feature data acquisition module, a feature word set acquisition module and a target category determination module. The names of these modules do not constitute a limitation of the module itself in some cases, and for example, the initial feature data acquisition module may also be described as "a module that acquires initial feature data of an object to be classified".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring initial characteristic data of an object to be classified, wherein the initial characteristic data comprises identification information data and attribute information data of the object to be classified; performing word segmentation on the identification information data and the attribute information data to obtain a feature word set, wherein the feature word set comprises at least one feature word; and carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on the trained classification model.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of object classification, comprising:

Acquiring initial characteristic data of an object to be classified, wherein the initial characteristic data comprises identification information data and attribute information data of the object to be classified;

Word segmentation is carried out on the identification information data and the attribute information data to obtain a characteristic word set, wherein the characteristic word set comprises at least one characteristic word;

carrying out vector representation on the feature words in the feature word set, and determining the target category to which the object to be classified belongs based on a trained classification model;

The step of performing word segmentation processing on the identification information data and the attribute information data to obtain a feature word set comprises the following steps:

performing word segmentation processing on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned;

Cleaning the plurality of feature words to be cleaned to obtain cleaned feature words, wherein the cleaned feature words form a feature word set;

The step of performing word segmentation processing on the identification information data and the attribute information data in the initial feature data to obtain a plurality of feature words to be cleaned comprises the following steps:

Word segmentation is carried out on the identification information data and the attribute information data in the initial feature data to obtain a plurality of initial feature words and word sequences of each initial feature word in the corresponding identification information data or attribute information data;

combining the plurality of initial feature words according to the word order to obtain combined feature words; the initial feature words and the combined feature words form the feature words to be cleaned.

2. The method of claim 1, wherein the step of performing a cleaning process on the plurality of feature words to be cleaned to obtain cleaned feature words comprises:

Removing useless words in the plurality of feature words to be cleaned to obtain feature words after first processing; wherein, the useless word at least comprises one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or

Removing words with feature information content lower than a threshold value in the plurality of feature words to be cleaned according to the historical word frequency of the plurality of feature words to be cleaned, and obtaining second processed feature words;

The cleaned feature words comprise the first processed feature words and/or the second processed feature words.

3. The method according to claim 1, wherein the step of acquiring initial feature data of the object to be classified, the initial feature data including identification information data and attribute information data of the object to be classified, includes:

Acquiring initial characteristic data of an object to be classified, wherein the object to be classified is an enterprise to be classified, and the initial characteristic data comprises information data indicating industries to which the enterprise to be classified belongs;

and extracting identification information data and attribute information data in the initial characteristic data, wherein the identification information data is an enterprise name, and the attribute information data is an operation range.

4. The method for constructing the classification model is characterized by comprising the following steps of:

Respectively obtaining sample feature data of a plurality of sample objects marked with belonging categories to obtain a text corpus; the sample characteristic data comprises identification information data and attribute information data of the sample object;

performing word segmentation on sample feature data in the text corpus to obtain feature dictionaries corresponding to all categories;

Training to obtain a classification model based on the feature dictionary corresponding to each category and a classification algorithm;

the step of performing word segmentation processing on the sample feature data in the text corpus to obtain feature dictionaries corresponding to various categories comprises the following steps:

Performing word segmentation on sample feature data corresponding to each category in the text corpus to obtain a plurality of initial sample feature words and word sequences of each initial sample feature word in the corresponding sample feature data;

Combining the plurality of initial sample feature words according to the word sequence to obtain combined sample feature words; the plurality of initial sample feature words and the combined sample feature words form the feature dictionary;

the step of combining the plurality of initial sample feature words according to the word order to obtain a combined sample feature word comprises the following steps:

combining initial sample feature words of a plurality of sample objects under each category according to the word sequence to obtain a word space of each sample object, wherein the word space comprises a plurality of to-be-screened combined feature words of the sample object under multiple granularities;

And calculating the similarity between word spaces of a plurality of sample objects according to each category, and screening out combined sample feature words from a plurality of combined feature words to be screened in the word spaces according to the similarity.

5. The method of claim 4, wherein the step of word segmentation of the sample feature data in the text corpus to obtain feature dictionaries corresponding to respective categories comprises:

performing word segmentation processing on the sample feature data in the text corpus to obtain a plurality of sample feature words to be cleaned corresponding to each category;

And removing useless words in the plurality of sample feature words to be cleaned, wherein the useless words at least comprise one of the following: prepositions, adverbs, repeated words, pre-specified words; and/or removing words with feature information content lower than a threshold value from the sample feature words to be cleaned according to the word frequencies of the sample feature words to be cleaned;

And obtaining cleaned sample feature words corresponding to each category, wherein the cleaned sample feature words form a feature dictionary.

6. An apparatus for classifying objects, comprising:

the target category determining module is used for carrying out vector representation on the feature words in the feature word set and determining the target category to which the object to be classified belongs based on a trained classification model;

The feature word set acquisition module is further configured to:

7. A classification model construction apparatus, comprising:

The training module is used for training to obtain a classification model based on the feature dictionary corresponding to each category and the classification algorithm;

wherein, the feature dictionary determining module is further configured to:

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

When executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3 or 4-5.

9. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-3 or 4-5.