CN110659655A

CN110659655A - Index classification method and device and computer readable storage medium

Info

Publication number: CN110659655A
Application number: CN201810691299.8A
Authority: CN
Inventors: 姚冬阳
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2020-01-07
Anticipated expiration: 2038-06-28
Also published as: CN110659655B

Abstract

The application provides an index classification method and device, a computer readable storage medium and an electronic device. The method for classifying the data comprises the following steps: acquiring a current index classification request; classifying the current indexes carried in the current index classification request by using an index classification model so as to classify the current indexes into corresponding classification subjects; the index classification model is determined based on the occurrence frequency of each participle of the marked index of the pre-established theme under different themes, the index number of different themes and the number of themes of each participle. According to the method and the device, the current indexes carried in the obtained current index classification request are classified by using the index classification model, the current indexes can be classified into corresponding classification subjects, automatic classification of the indexes can be achieved, long-term maintenance of index classification by professionals is not needed, labor cost is saved, and the use threshold of a user is reduced.

Description

Index classification method and device and computer readable storage medium

Technical Field

The present disclosure relates to artificial intelligence technologies, and in particular, to an index classification method and apparatus, a computer-readable storage medium, and an electronic device.

Background

Business Intelligence (BI for short) is a complete solution for effectively integrating existing data in an enterprise, rapidly and accurately providing reports and providing decision basis to help the enterprise make intelligent Business operation decisions.

Currently, business index classification of BI relies mainly on familiar business and experienced personnel for manual processing. There are two problems with relying on manual processing: on one hand, with the increasing expansion of the number of the service indexes, a large number of unclassified indexes are accumulated in the data warehouse, so that service personnel can process and maintain one by one and consume certain labor cost; on the other hand, in the process of using the BI analysis tool, a data analyst often introduces a new data source, that is, adds a new service index, and the use threshold of the tool is raised by classifying the new service index. Therefore, it is urgently needed to provide an index automatic classification method.

Disclosure of Invention

In view of the above, the present application provides an index classification method and apparatus, a computer-readable storage medium, and an electronic device.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of the embodiments of the present disclosure, there is provided an index classification method, the method including:

acquiring a current index classification request;

classifying the current indexes carried in the current index classification request by using an index classification model so as to classify the current indexes into corresponding classification subjects;

the index classification model is determined based on the occurrence frequency of each participle of the labeled index of the pre-established theme under different themes, the index number of different themes and the number of themes of each participle.

In an embodiment, the method further comprises:

acquiring the marked indexes of the pre-established topics from a data warehouse through a plurality of computing nodes;

determining the index classification model through the plurality of computing nodes based on the obtained occurrence frequency of each participle of the marked index under different subjects, the obtained index number of the different subjects and the obtained number of subjects in which each participle appears, and caching the index classification model.

In an embodiment, the determining the index classification model based on the obtained number of occurrences of each participle of the labeled index under different topics, the index number of the different topics, and the number of topics of each participle, includes:

performing word segmentation on each obtained labeled index;

counting the occurrence times of each participle under different topics;

calculating the score of each participle under different topics based on the occurrence frequency of each participle under different topics, the index number of the different topics and the number of topics of each participle;

and determining the score of each participle under different topics as the index classification model.

In an embodiment, the classifying, by using an index classification model, a current index carried in the current index classification request to classify the current index into a corresponding classification subject includes:

obtaining the voting result of each computing node on the classification subject of the current index and the total score of the current index under the corresponding classification subject according to the classification request of the current index;

if the number of the classification subjects with the largest number of votes is one, taking the classification subject with the largest number of votes as the classification subject of the current index;

and if the number of the classification subjects with the largest number of votes is more than one, taking the classification subject with the largest number of votes and the largest sum of the total scores as the classification subject of the current index.

In an embodiment, the obtaining, according to the current index categorization request, a voting result of each computing node for the categorization topic of the current index and a total score of the current index under the corresponding categorization topic includes:

performing word segmentation on the current index;

inquiring the index classification model of each computing node cache according to the current participle corresponding to the current index to obtain the score of each current participle cached by each computing node under each theme;

performing optimization operation on the score of each current participle cached by each computing node under each theme to obtain the total score of the current index calculated by each computing node under each theme;

and taking the topic with the maximum total score calculated by each computing node as the classification topic of the current index obtained by the corresponding computing node.

In an embodiment, the performing an optimization operation on the score of each current participle cached by each computing node under each topic to obtain a total score of the current index calculated by each computing node under each topic includes:

taking logarithm operation on the fraction of each current participle cached by each computing node under each theme to obtain an operation result;

and summing the operation results to obtain the total score of the current index calculated by each computing node under each theme.

In one embodiment, the obtaining, by the plurality of computing nodes, the labeled indicator of the pre-established topic from the data warehouse includes:

obtaining the annotated indices equally from under each pre-established topic of the data warehouse through a plurality of computing nodes.

In an embodiment, the current index includes a real-time newly added index or an offline import index.

According to a second aspect of the embodiments of the present disclosure, there is provided an index classification apparatus including:

the acquisition module is used for acquiring a current index classification request;

the classification module is used for classifying the current index carried in the current index classification request acquired by the acquisition module by using an index classification model so as to classify the current index into a corresponding classification subject;

According to a third aspect of the embodiments of the present disclosure, there is provided a variety of computer-readable storage media storing a computer program for executing the above-described index classification method.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the processor implements the above index classification method when executing the computer program.

According to the method and the device, the current indexes carried in the obtained current index classification request are classified by using the index classification model, the current indexes can be classified into corresponding classification subjects, automatic classification of the indexes can be achieved, long-term maintenance of index classification by professionals is not needed, labor cost is saved, and the use threshold of a user is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart illustrating a method of index categorization according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating a process for determining an index classification model according to an exemplary embodiment of the present application;

FIG. 3 is a flow diagram illustrating the determination of an index classification model by each compute node according to an exemplary embodiment of the present application;

FIG. 4 is a flow chart illustrating another method of index categorization, according to an exemplary embodiment of the present application;

FIG. 5 is a flowchart illustrating an exemplary embodiment of the present application for obtaining the voting result of each computing node on the classification subject of the current index and the total score of the current index under the corresponding classification subject;

FIG. 6 is a hardware configuration diagram of an electronic device in which the index classifying apparatus of the present application is installed;

fig. 7 is a block diagram of an index classifying apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a flowchart illustrating an index classification method according to an exemplary embodiment of the present application, where the method includes:

and step S101, acquiring a current index classification request.

When a user triggers a classification request for the current index at the front end of the service platform, the service platform can acquire the classification request for the current index.

Because the current index may include the real-time newly added index or the offline import index, the current index classification request may include the current real-time newly added index classification request or the current offline import index classification request.

And S102, classifying the current indexes carried in the current index classification request by using the index classification model so as to classify the current indexes into corresponding classification subjects.

The index classification model is determined based on the occurrence frequency of each participle of the marked index of the pre-established theme under different themes, the index number of different themes and the number of themes of each participle.

In this embodiment, business personnel may pre-establish topics and label each topic with a small number of business indicators. Then, a business platform such as a BI report platform can establish an index classification model by using the marked indexes as training data. And then, updating the index classification model by taking all marked indexes as training data periodically.

Specifically, the service platform may establish or update the index classification model based on the number of occurrences of each participle of the labeled index under different topics, the index number of different topics, and the number of topics of each participle.

Since the index classification model is already determined in this embodiment, the index classification model may be used to classify the current index carried in the current index classification request, so as to classify the current index into the corresponding classification subject. For example, the current real-time newly added indexes are classified into a certain theme, and a large number of offline imported indexes are classified into corresponding themes in batch.

Therefore, according to the embodiment, the index classification is not required to be maintained for a long time by professionals, and only a small number of historical indexes are required to be accurately marked by the professionals, so that the automatic classification can be realized for the newly added indexes and the historical indexes in real time.

According to the embodiment, the current indexes carried in the obtained current index classification request are classified by using the index classification model, the current indexes can be classified into the corresponding classification subjects, the automatic classification of the indexes can be realized, the long-term maintenance of index classification by professionals is not needed, the labor cost is saved, and the use threshold of a user is reduced.

FIG. 2 is a flowchart illustrating a process of determining an index classification model according to an exemplary embodiment of the present application, as shown in FIG. 2, the process of determining includes:

step S201, obtaining the marked indexes of the pre-established subjects from the data warehouse through a plurality of computing nodes.

In order to ensure the online updating frequency of the index classification model under the condition of mass labeled indexes, the embodiment obtains the labeled indexes of the pre-established subjects from the data warehouse through a plurality of computing nodes.

Wherein the annotated index may be obtained equally from under each pre-established topic of the data warehouse by a plurality of computing nodes.

In order to sufficiently improve the parallelism, a Single Instruction Multiple Data (SIMD) mode may be used. For example, each compute node may pull the metrics equally (Load Balance) from under each topic by taking the respective node Identification (ID) modulo (mod), and if the number of metrics is less than the number of nodes, pull all of the metrics.

For example, if the plurality of compute nodes are nodes 1-3, subject 1 includes indices 1-6, and subject 2 includes index 7, then compute node 1 may pull index 1 and index 4 for subject 1 and index 7 for subject 2 from the data warehouse, compute node 2 may pull index 2 and index 5 for subject 1 and index 7 for subject 2 from the data warehouse, and compute node 3 may pull index 3 and index 6 for subject 1 and index 7 for subject 2 from the data warehouse.

Step S202, determining an index classification model based on the obtained occurrence frequency of each participle of the labeled index under different subjects, the index number of different subjects and the number of subjects in which each participle appears through a plurality of computing nodes, and caching the index classification model.

As shown in FIG. 3, the process of determining the index classification model by each compute node may include:

step S2021, performing word segmentation on each of the obtained labeled indicators.

Step S2022, count the number of occurrences of each participle under different topics.

To clearly describe the process of determining the index classification model for each compute node, the following description is made in conjunction with specific examples.

For example, the topics acquired by the current computing node are topic 1 and topic 2, the topic 1 includes indexes 1-3, the topic 2 includes index 4, the index 1 includes participle 1 and participle 2, the index 2 includes participle 1 and participle 3, the index 3 includes participle 4, the index 4 includes participle 1 and participle 2, the number of occurrences of the participle 1 under the topic 1 is 2, the number of occurrences of the participle 1 under the topic 2 is 1, the number of occurrences of the participle 2 under the topic 1 is 1, the number of occurrences of the participle 2 under the topic 2 is 1, the number of occurrences of the participle 3 under the topic 1 is 1, and the number of occurrences of the participle 4 under the topic 1 is 1.

Step S2023, calculating the score of each participle under different themes based on the occurrence frequency of each participle under different themes, the index number of different themes and the number of themes in which each participle appears.

Wherein, the score of each participle under different topics can be: the occurrence frequency of each participle under different topics is divided by the index number of different topics, and then divided by the number of topics in which the corresponding participle appears.

Continuing the description of the above example, the index number of the topic 1 is 3, the index number of the topic 2 is 1, the number of topics appearing in the participle 1 is 2, the number of topics appearing in the participle 2 is 2, the number of topics appearing in the participle 3 is 1, and the number of topics appearing in the participle 4 is 1.

Thus, the score for the participle 1 under topic 1 is: 2/3/2 ═ 0.33, the score for participle 2 under topic 1 is: 1/3/2 ═ 0.17, the score for participle 3 under topic 1 is: 1/3/1 ═ 0.33, the score for participle 4 under topic 1 is: 1/3/1 ═ 0.33, the score for participle 1 under topic 2 is: 1/1/2 ═ 0.5, the score for participle 2 under topic 2 is: 1/1/2 is 0.5.

Step S2024, determining the scores of each participle under different topics as an index classification model.

In this embodiment, the score of each participle under different topics is determined as an index classification model.

After the index classification model is determined, the index classification model is cached, so that the classification time complexity is irrelevant to the index number and is only relevant to the theme number, and the theme number is far smaller than the index number, so that the classification time of the index can be effectively reduced, namely the classification efficiency of the index is improved.

In the embodiment, the marked indexes of the pre-established topics are acquired from the data warehouse through the plurality of computing nodes, the index classification model is determined through the plurality of computing nodes based on the occurrence frequency of each participle of the marked indexes under different topics, the index number of different topics and the number of topics in which each participle appears, and the index classification model is cached, so that the determination efficiency of the index classification model can be improved, and conditions are provided for the subsequent improvement of the automatic index classification efficiency.

Fig. 4 is a flowchart illustrating another index classification method according to an exemplary embodiment of the present application, where as shown in fig. 4, the index classification method includes:

step S401, obtaining marked indexes of the pre-established subjects from the data warehouse through a plurality of computing nodes.

Step S402, determining an index classification model through a plurality of computing nodes based on the number of times of occurrence of each participle of the obtained labeled index under different subjects, the number of indexes of different subjects and the number of subjects of each participle, and caching the index classification model.

Step S403, acquiring a current index classification request.

Step S404, according to the current index classification request, obtaining the voting result of each computing node on the classification subject of the current index and the total score of the current index under the corresponding classification subject.

As shown in fig. 5, obtaining the voting result of each computing node on the classification topic of the current index and the total score of the current index under the corresponding classification topic may include:

step S4041, performing word segmentation on the current index.

Assume that the participles corresponding to the current index are participle 1 and participle 2.

Step S4042, the index classification model of each computing node cache is inquired according to the current participle corresponding to the current index, and the score of each current participle cached by each computing node under each theme is obtained.

Assuming that the index classification model of the computing node 1 cache is as follows: the score of the participle 1 under the topic 1 is: 2/3/2 ═ 0.33, the score for participle 2 under topic 1 is: 1/3/2 ═ 0.17, the score for participle 3 under topic 1 is: 1/3/1 ═ 0.33, the score for participle 4 under topic 1 is: 1/3/1 ═ 0.33, the score for participle 1 under topic 2 is: 1/1/2 ═ 0.5, the score for participle 2 under topic 2 is: 1/1/2 is 0.5.

Step S4043, performing optimization operation on the score of each current participle under each topic to obtain the total score of the current index calculated by each computing node under each topic.

In this embodiment, the scores of each current participle cached by each computing node under each topic are subjected to logarithmic operation to obtain operation results, and the operation results are summed to obtain the total score of the current index calculated by each computing node under each topic.

Continuing with the description of the above example, if the score of the participle 1 under the topic 1 is 0.33, and the score of the participle 2 under the topic 1 is 0.17, then the total score of the current index calculated by the computing node 1 under the topic 1 can be obtained as follows: log (0.33) + log (0.17). The score of the participle 1 under the topic 2 is 0.5, and the score of the participle 2 under the topic 2 is 0.5, so that the total score of the current index under the topic 2, which is calculated by the calculation node 1, can be obtained as follows: log (0.5) + log (0.5).

In the embodiment, the scores of each current participle under each topic are subjected to logarithm operation to show that each participle is important for determining the classification topic, namely the overall effect of a plurality of participles is improved, and the accuracy of index classification is improved subsequently.

Step S4044, the topic with the maximum total score calculated by each computing node is used as the classification topic of the current index obtained by the corresponding computing node.

Continuing the description of the above example, since the total score of the current index under the topic 2 calculated by the computing node 1 is greater than the total score of the current index under the topic 1, the computing node 1 obtains the classification topic of the current index as the topic 2.

For other settlement nodes, the process of determining the classification subject of the current index is the same as the determination process of the computing node 1, and details are omitted here.

And S405, if the number of the classification subjects with the largest number of votes is one, taking the classification subject with the largest number of votes as the classification subject of the current index.

And finally determining that the subject 2 is the classification subject of the current index if the classification subject of the current index obtained by the computing node 1, the computing node 3 and the computing node 4 is the subject 2 and the classification subject of the current index obtained by the computing node 2 is the subject 1.

Step S406, if the number of the classification subjects with the largest number of tickets is more than one, the classification subject with the largest number of tickets and the largest sum of total scores is taken as the classification subject of the current index.

If the classification theme of the current index obtained by the computing node 1 and the computing node 3 is theme 2, and the classification theme of the current index obtained by the computing node 2 and the computing node 4 is theme 1, the sum of the total score of the current index obtained by the computing node 1 under the theme 2 and the total score of the current index obtained by the computing node 3 under the theme 2 is assumed to be X, the sum of the total score of the current index obtained by the computing node 2 under the theme 1 and the total score of the current index obtained by the computing node 4 under the theme 1 is assumed to be Y, and X is greater than Y, the classification theme of the current index is determined to be theme 2.

According to the embodiment, the classification theme is determined for the current index according to the voting results of the plurality of computing nodes and the total score of the current index computed by the plurality of computing nodes under each theme, so that the classification bias is reduced, and the accuracy of index classification is improved.

Corresponding to the embodiment of the index classification method, the application also provides an embodiment of the index classification device.

The embodiment of the index classification device can be applied to electronic equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. As shown in fig. 6, a hardware structure of an electronic device in which the index classifying apparatus 600 of the present application is located includes a processor 610, a memory 620, and a computer program stored in the memory 620 and capable of running on the processor 610, and when the processor 610 executes the computer program, the above-mentioned index classifying method is implemented. In addition to the processor 610 and the memory 620 shown in fig. 6, the electronic device in which the apparatus is located in the embodiment may also include other hardware, which is not described herein again, generally according to the actual functions classified by the index.

Fig. 7 is a block diagram of an index classifying apparatus according to an exemplary embodiment of the present application, as shown in fig. 7, the apparatus including: an acquisition module 71 and a classification module 72.

The obtaining module 71 is configured to obtain a current index classification request.

When the user triggers a classification request for the current index at the front end of the service platform, the obtaining module 71 of the service platform may obtain the classification request for the current index.

The classifying module 72 is configured to perform a classifying process on the current index carried in the current index classifying request acquired by the acquiring module 71 by using an index classifying model, so as to classify the current index into a corresponding classifying topic.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium storing a computer program for executing the index categorization method described above, wherein the index categorization method includes:

acquiring a current index classification request;

The computer readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a compact disc read only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed.

Claims

1. An index classification method, characterized in that the method comprises:

acquiring a current index classification request;

2. The method of claim 1, further comprising:

3. The method according to claim 2, wherein the determining the index classification model based on the obtained number of occurrences of each participle of the labeled index under different topics, the index number of the different topics, and the number of topics of each participle, comprises:

performing word segmentation on each obtained labeled index;

counting the occurrence times of each participle under different topics;

4. The method according to claim 2 or 3, wherein the classifying the current index carried in the current index classification request by using an index classification model to classify the current index into the corresponding classification subject comprises:

5. The method as claimed in claim 4, wherein the obtaining of the voting result of each computing node on the classification subject of the current index and the total score of the current index under the corresponding classification subject according to the classification request of the current index comprises:

performing word segmentation on the current index;

6. The method according to claim 5, wherein the performing an optimization operation on the score of each current participle cached by each computing node under each topic to obtain the total score of the current index computed by each computing node under each topic comprises:

7. The method of claim 2, wherein obtaining the labeled indicators of the pre-established topics from a data repository through a plurality of computing nodes comprises:

8. The method of claim 1, wherein the current metrics comprise real-time added metrics or offline imported metrics.

9. An index classifying apparatus, characterized in that the apparatus comprises:

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the index classifying method according to any one of claims 1 to 8.

11. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor implements the index classification method according to any one of claims 1 to 8 when executing the computer program.