CN113920381A - Repeated derivative index identification method, electronic device and readable storage medium - Google Patents

Repeated derivative index identification method, electronic device and readable storage medium Download PDF

Info

Publication number
CN113920381A
CN113920381A CN202111527387.2A CN202111527387A CN113920381A CN 113920381 A CN113920381 A CN 113920381A CN 202111527387 A CN202111527387 A CN 202111527387A CN 113920381 A CN113920381 A CN 113920381A
Authority
CN
China
Prior art keywords
index
derived
derivative
preset
derived index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111527387.2A
Other languages
Chinese (zh)
Other versions
CN113920381B (en
Inventor
温桂龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Mingyuan Cloud Technology Co Ltd
Original Assignee
Shenzhen Mingyuan Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Mingyuan Cloud Technology Co Ltd filed Critical Shenzhen Mingyuan Cloud Technology Co Ltd
Priority to CN202111527387.2A priority Critical patent/CN113920381B/en
Publication of CN113920381A publication Critical patent/CN113920381A/en
Application granted granted Critical
Publication of CN113920381B publication Critical patent/CN113920381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a repeated derivative index identification method, electronic equipment and a readable storage medium, wherein the repeated derivative index identification method comprises the following steps: when a new request of the derived indexes is detected, receiving the input of the current first derived index, and acquiring a second derived index from a preset data warehouse; inputting the first derived index and the second derived index into a preset Bayes classifier to obtain the repetition probability of the first derived index and the second derived index; and if the repetition probability is greater than or equal to a preset repetition probability threshold value, judging that the first derived index and the second derived index are repeated. The method and the device solve the technical problem of low efficiency of identifying repeated derived indexes from the data warehouse.

Description

Repeated derivative index identification method, electronic device and readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method for identifying duplicate derivation indicators, an electronic device, and a computer-readable storage medium.
Background
The data warehouse is a theme-oriented, integrated, time-varying, but relatively stable data set of information itself in the big data field, and it is usually used for supporting the management decision process, however, in a data warehouse of an enterprise, after a period of accumulation, there may be hundreds of derived indexes, but because these indexes are established by multiple modelers respectively or by the same modeler at different times during actual modeling, there may be derived indexes with the same meaning in these derived indexes, when the enterprise needs to deposit the derived index library, the derived indexes with the same meaning are difficult to distinguish, and the client does not know which one is used specifically when using the derived indexes, and manually identifies these indexes, the workload is large, the operation is cumbersome, and it takes time and labor.
Disclosure of Invention
The application mainly aims to provide a repeated derived index identification method, electronic equipment and a readable storage medium, and aims to solve the technical problem of low efficiency of identifying repeated derived indexes from a data warehouse.
In order to achieve the above object, the present application provides a duplicate derived index identification method, including:
when a new request of the derived indexes is detected, receiving the input of the current first derived index, and acquiring a second derived index from a preset data warehouse;
inputting the first derived index and the second derived index into a preset Bayes classifier to obtain the repetition probability of the first derived index and the second derived index;
and if the repetition probability is greater than or equal to a preset repetition probability threshold value, judging that the first derived index and the second derived index are repeated.
Optionally, the step of inputting the first derived index and the second derived index into a preset bayesian classifier to obtain a repetition probability of the first derived index and the second derived index further includes:
if the repetition probability is smaller than a preset repetition probability threshold, returning to the execution step: acquiring a second derivative index from a preset data warehouse;
and if the second derivative index which can be obtained does not exist in the data warehouse, judging that the derivative index which is repeated with the first derivative index does not exist in the data warehouse.
Optionally, the step of inputting the first derived index and the second derived index into a preset bayesian classifier to obtain a repetition probability of the first derived index and the second derived index includes:
acquiring a manually marked derived index sample as a training set;
training a preset Bayes classifier according to the training set;
inputting the first derived index and the second derived index into a trained Bayes classifier to obtain a repetition probability of the first derived index and the second derived index.
Optionally, the step of training a preset bayesian classifier according to the training set comprises:
determining characteristic values corresponding to the first derivative index and the second derivative index according to a preset rule, wherein the characteristic values comprise a name contrast value, an atom index contrast value, a statistic granularity contrast value, a statistic cycle contrast value and a business limitation contrast value;
and training a preset Bayes classifier according to the name contrast value, the atomic index contrast value, the statistic granularity contrast value, the statistic period contrast value and the service limit contrast value in combination with the training set.
Optionally, the step of training a preset bayesian classifier according to the name contrast value, the atomic index contrast value, the statistical granularity contrast value, the statistical period contrast value and the traffic limitation contrast value in combination with the training set comprises:
acquiring the word number of the same word in the names of the first derived index and the second derived index and the total word number of the names of the first derived index and the second derived index;
calculating the ratio of the word number of the same word in the total word number to obtain a name comparison value;
determining an atomic index contrast value according to whether the atomic indexes of the first derivative index and the second derivative index are the same;
determining a comparison value of the statistical granularity according to whether the statistical granularity of the first derived index and the second derived index is the same;
acquiring the same statistical period component elements in the first derivative index and the second derivative index and all statistical period component elements in the first derivative index and the second derivative index;
calculating the ratio of the same statistical period component elements in all the statistical period component elements to obtain a statistical period comparison value;
acquiring the same service restriction component elements in the first derived index and the second derived index and all service restriction component elements in the first derived index and the second derived index;
and calculating the ratio of the same service restriction component in all the service restriction components to obtain a service restriction comparison value.
Optionally, after the step of determining that the first derived index and the second derived index are repeated, the method further includes:
outputting prompt information identifying a derived metric that duplicates the first derived metric.
Optionally, after the step of outputting the prompt information identifying the derived index that is repeated with the first derived index, the method further includes:
and deleting or merging the repeated first derivative indexes or second derivative indexes based on the user operation received by the prompt message.
Optionally, the repeated derivative indicator identification method further includes:
when a derived index duplicate checking request is detected, comparing the derived indexes in a preset data warehouse through a preset Bayes classifier;
and deleting or merging each target derived index when at least one repeated group of target derived indexes exist in the preset data warehouse.
The present application further provides an electronic device, the electronic device is an entity device, the electronic device includes: a memory, a processor, and a program of the duplicate derived indicator identification method stored on the memory and executable on the processor, the program of the duplicate derived indicator identification method when executed by the processor being capable of implementing the steps of the duplicate derived indicator identification method as described above.
The present application also provides a readable storage medium which is a computer readable storage medium having stored thereon a program for implementing a duplicate derived index identification method, the program for implementing the duplicate derived index identification method implementing the steps of the duplicate derived index identification method as described above when executed by a processor.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the above-described repeat derivation index identification method.
The application provides a repeated derived index identification method, an electronic device and a readable storage medium, when a new request of derived indexes is detected, the input of a current first derived index is received, a second derived index is obtained from a preset data warehouse, the determination of the derived indexes to be compared is realized, the repetition probability of the first derived index and the second derived index is obtained by inputting the first derived index and the second derived index into a preset Bayes classifier, the repetition probability between the derived indexes to be compared is automatically determined by the Bayes classifier, the first derived index and the second derived index are judged to be repeated if the repetition probability is greater than or equal to a preset repetition probability threshold value, the repeated derived indexes are identified according to the repetition probability, and the identification efficiency of the repeated derived indexes is greatly improved, the technical problem of low efficiency of identifying repeated derived indexes from the data warehouse is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a repeated derivative indicator identification method according to the present application;
FIG. 2 is a schematic flow chart illustrating another embodiment of a duplicate derivative indicator identification method according to the present application;
fig. 3 is a schematic device structure diagram of a hardware operating environment related to the repeated derivative index identification method in the embodiment of the present application.
The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In an embodiment of the repeated derivative index identification method, referring to fig. 1, the repeated derivative index identification method includes:
step S10, when a new request of the derived index is detected, receiving the input of the current first derived index, and acquiring a second derived index from a preset data warehouse;
in this embodiment, it should be noted that the data warehouse is a structured data environment of a decision support system and an online analysis application data source, the derived index is a statistical index commonly used in data warehouse dimensional modeling, and is generated based on an atomic index to define a business statistical range, so as to ensure a statistical index standard, a standard, and an unambiguous manner, the derived index includes at least one characteristic attribute, the characteristic attribute forms a part of the derived index, the characteristic attribute can be unified and determined through a standardized data system, or unified and determined through a preset derived index input rule, the characteristic attribute includes an atomic index, a business definition, a statistical period, a statistical granularity, and/or a theme domain, and the like, wherein the atomic index is an abstraction of an index statistical caliber and a specific algorithm, is a standardized definition of computation logic, is an index that is not resolvable in a business, has a definite business noun, such as payment amount, etc., the business definition represents a statistical business range, screens out records meeting business rules, is a standardized definition of condition limitation, the statistical period represents a statistical time range, also may be referred to as a time period, and is used for defining a time range or a time point of data statistics, the statistical granularity is an object or a view angle of statistical analysis, the degree to which data needs to be summarized is defined as a grouping condition when aggregation operation, the granularity is a combination of dimensions, the statistical range is indicated, for example, if a certain index is a bargain amount of a certain seller in a certain province, the granularity budget is a combination of the two dimensions of the seller and the province, and if the derived index is a quarter difference in a section 2021 year, then an atomic index of the derived index is a fee amount, the business is limited to business on business trip, the statistical period is 2021/1/1 to 2021/3/31, the statistical granularity is A department, and the subject domain is used for storing indexes with different concepts in the same business board block, for example, you can divide a commodity domain, a transaction domain, a member domain and the like for storing indexes with different meanings.
Specifically, when a new request of a derived index of a data warehouse is detected, a first derived index input based on the new request is received, and a derived index already stored in the data warehouse is obtained from a preset data warehouse as a second derived index, where the second derived index is a derived index that is not repeatedly derived index identified from the first derived index in the data warehouse, and the second derived index may be obtained by obtaining any one stored derived index from the data warehouse or obtaining a derived index in a specified order from the data warehouse according to a preset order (e.g., a preset time order or a preset arrangement order), and this embodiment is not limited.
Step S20, inputting the first derived index and the second derived index into a preset Bayes classifier to obtain the repetition probability of the first derived index and the second derived index;
in this embodiment, it should be noted that the bayesian classifier includes a naive bayesian classifier and the like, and the naive bayesian classifier is a series of simple probability classifiers based on bayesian theorem under the assumption that strong (naive) between feature attributes are independent, and is configured to perform feature attribute analysis and operation on two input derived indicators to obtain repetition probability and non-repetition probability of the two input derived indicators, where the feature attributes are determined when the bayesian classifier is constructed, extract feature attributes from the derived indicators when the derived indicators are repeatedly identified based on the bayesian classifier, calculate feature values, and calculate the repetition probability and the non-repetition probability according to the feature values, where the feature attributes include atomic indicators, business limitations, statistical periods, statistical granularities, and/or subject domains, and the feature values may be similarities, business limitations, statistical granularities, and/or subject domains, Repetition rate or score, etc.
Specifically, the first derived index and the second derived index are input into a preset Bayes classifier to obtain a repetition probability and a non-repetition probability of the first derived index and the second derived index, and the repetition probability is obtained from an analysis result of the Bayes classifier.
Step S30, if the repetition probability is greater than or equal to a preset repetition probability threshold, determining that the first derived index and the second derived index are repeated.
In this embodiment, specifically, a magnitude relationship between the repetition probability and a preset repetition probability threshold is determined, if the repetition probability is greater than or equal to the preset repetition probability threshold, it is determined that the first derived index and the second derived index are repeated, and if the repetition probability is less than the preset repetition probability threshold, it is determined that the first derived index and the second derived index are not repeated, where the repetition probability threshold may be obtained through previous experience and analysis, or may be obtained through error calculation through model training and sample accumulation during a process of identifying a repeated derived index, and adjusting the repetition probability threshold according to an error.
Optionally, the step of inputting the first derived index and the second derived index into a preset bayesian classifier to obtain a repetition probability of the first derived index and the second derived index further includes:
step A10, if the repetition probability is less than a preset repetition probability threshold, returning to execute the steps of: acquiring a second derivative index from a preset data warehouse;
in this embodiment, specifically, if the repetition probability is smaller than a preset repetition probability threshold, it is determined that the first derived index and the second derived index are not repeated, and the method returns to the step of: and acquiring a second derivative index from a preset data warehouse, acquiring a new second derivative index from the preset data warehouse, and performing repeated derivative index identification with the first derivative index.
Step a20, if there is no second derivative indicator available in the data warehouse, determining that there is no derivative indicator in the data warehouse that duplicates the first derivative indicator.
In this embodiment, specifically, if there is no acquirable second derived index in the data warehouse, it is determined that the first derived index is compared with all second derived indexes in a preset data warehouse, and the repetition probability is not greater than or equal to a preset repetition probability threshold, and it is determined that the repetition probability is smaller than the preset repetition probability threshold, so that there is no derived index that overlaps with the first derived index in the data warehouse.
Optionally, after the step of determining that the first derived index and the second derived index are repeated, the method further includes:
outputting prompt information identifying a derived metric that duplicates the first derived metric.
In this embodiment, specifically, when it is determined that the first derived index overlaps with the second derived index, a prompt message identifying a derived index overlapping with the first derived index is output to prompt a user that a derived index overlapping with the currently newly-built first derived index already exists in the data warehouse.
Optionally, after the step of outputting the prompt information identifying the derived index that is repeated with the first derived index, the method further includes:
and deleting or merging the repeated first derivative indexes or second derivative indexes based on the user operation received by the prompt message.
In this embodiment, specifically, when outputting the prompt information identifying the derived index that overlaps with the first derived index, detecting a user operation generated based on the prompt information, and deleting or merging the overlapping first derived index or second derived index based on the received user operation, where the user operation includes a deletion operation of the first derived index, a deletion operation of continuing to input the first derived index and performing deletion operation of the second derived index, an operation of continuing to input the first derived index and retaining the second derived index, or an operation of merging both the first derived index and the second derived index, and in an implementable manner, when outputting the prompt information identifying the derived index that overlaps with the first derived index, the second derived index may also be displayed, specifically, the characteristic attribute of the second derived index may be displayed, and further, the same characteristic attribute may be highlighted, so that the user can know the specific information of the second derived index, and further, the subsequent operation is performed more accurately.
In this embodiment, by deleting or merging the repeated derived indexes, it is possible to effectively avoid increasing the data volume of unnecessary derived indexes in the data warehouse, which is beneficial to data management of the data warehouse, and makes the selection of the derived indexes more explicit and simple when the client uses the data warehouse.
Optionally, the repeated derivative indicator identification method further includes:
step B10, when a request for duplicate checking of derived indexes is detected, comparing the derived indexes in a preset data warehouse through a preset Bayes classifier;
in this embodiment, specifically, when a request for retrieving derived indexes is detected, any two derived indexes to be compared are obtained from a preset data warehouse, the derived indexes to be compared are input into a preset bayesian classifier, so as to obtain a repetition probability of the two derived indexes to be compared, if the repetition probability is greater than or equal to a preset repetition probability threshold, it is determined that the two derived indexes to be compared are repeated, and if the repetition probability is greater than or equal to a preset repetition probability threshold, it is determined that the two derived indexes to be compared are not repeated until any two derived indexes in the data warehouse are compared through the preset bayesian classifier.
And step B20, deleting or merging each target derived index when determining that at least one repeated group of target derived indexes exist in the preset data warehouse.
In this embodiment, specifically, when it is determined that at least one group of repeated target derived indexes exists in the preset data warehouse, deleting or merging the target derived indexes based on user operation information, where each group of target derived indexes includes two or more repeated target derived indexes, for example, if it is determined that both a derived index a and a derived index B are repeated in the data warehouse through duplicate checking, and both the derived index a and the derived index B form a group of target derived indexes, deleting or merging the group of target derived indexes; and if the duplication of the derived indexes A and B is determined by duplication checking, and the derived indexes C, D and E are duplicated in the data warehouse, the derived indexes A and B form a group of target derived indexes, the derived indexes C, D and E form another group of target derived indexes, and the two groups of target derived indexes are deleted or combined respectively.
In this embodiment, the derived indexes stored in the data warehouse can be cleaned and adjusted by sending a duplicate checking request, so as to reduce the number of repeated derived indexes in the data warehouse, effectively avoid increasing the data volume of unnecessary derived indexes in the data warehouse, facilitate data management of the data warehouse, and enable selection of the derived indexes by a client during use to be more clear and simple.
In the present embodiment, by receiving an input of the current first derived index when a derived index new request is detected, and obtains the second derived index from the preset data warehouse, realizes the determination of the derived index to be compared, and then the first derived index and the second derived index are input into a preset Bayes classifier to obtain the repeat probability of the first derived index and the second derived index, so that the repeat probability between the derived indexes to be compared is automatically determined by the Bayes classifier, and if the repetition probability is greater than or equal to a preset repetition probability threshold value, the first derived index and the second derived index are judged to be repeated, so that the repeated derived indexes are identified according to the repetition probability, the identification efficiency of the repeated derived indexes is greatly improved, and the technical problem of low efficiency of identifying the repeated derived indexes from a data warehouse is solved.
Further, referring to fig. 2, based on the foregoing embodiment of the present application, in another embodiment of the present application, the same or similar contents to the foregoing embodiment may be referred to the foregoing description, and are not repeated herein. On this basis, the step of inputting the first derived index and the second derived index into a preset bayesian classifier to obtain the repetition probability of the first derived index and the second derived index comprises:
step S21, acquiring a manually labeled derived index sample as a training set;
step S22, training a preset Bayes classifier according to the training set;
in this embodiment, specifically, a derived index sample is obtained, the derived index sample is manually labeled as repeated or not, the manually labeled derived index sample is used as a training set, and the training set is input into a preset bayesian classifier for training, where it is to be noted that a process of training the bayesian classifier in this embodiment is basically the same as that in the prior art, and only differs from that in input data.
Optionally, the step of training a preset bayesian classifier according to the training set comprises:
step C10, determining characteristic values corresponding to the first derivation index and the second derivation index according to a preset rule, wherein the characteristic values include a name contrast value, an atom index contrast value, a statistical granularity contrast value, a statistical period contrast value and a business limitation contrast value;
in this embodiment, specifically, a name, an atomic index, a business limitation, a statistical period, and a statistical granularity of each of the first derived index and the second derived index are obtained, the obtained name, the atomic index, the business limitation, the statistical period, and the statistical granularity are preprocessed and calculated according to a preset rule, and a name contrast value, an atomic index contrast value, a statistical granularity contrast value, a statistical period contrast value, and a business limitation contrast value of the first derived index and the second derived index are determined, where the name contrast value is used to indicate a difference size between names of the first derived index and the second derived index, for example, whether the names are the same, or a ratio of the same words in the names, or a ratio of words or words having the same meaning in the names, and the like; the atomic index contrast value is used for representing the difference between the atomic indexes of the first derived index and the second derived index, for example, whether the atomic indexes are the same or not, or whether the atomic indexes represent the same meaning or not; the comparison value of the statistical granularities is used for representing the difference between the statistical granularities of the first derived index and the second derived index, for example, whether the statistical granularities are the same or whether an affiliation exists between the statistical granularities, etc.; the statistical cycle contrast value is used for representing the difference between the statistical cycles of the first derived index and the second derived index, for example, whether the statistical cycles are the same or not, or the proportion of the same time range in the statistical cycles, or the proportion of the same component in the components of the statistical cycles, etc. (for example, when the A statistical cycle is >2021.1.1 and <2021.3.28, and the B statistical cycle is >2021.1.1 and <2021.3.22, if the components of the statistical cycles are set to include two time endpoints, the proportion of the same component in the components of the A and B statistical cycles is 0.5); the service restriction contrast value is used to indicate the difference between the service restrictions of the first derived index and the second derived index, for example, whether the service restrictions are the same, or the ratio of the same service range in the service restrictions, or the ratio of the same component in the service-restricted components (for example, when the service a is restricted to liaoning, or black dragon river, or jilin, and the service B is restricted to liaoning, if the component for setting the statistical period is province, the ratio of the same component in the service-restricted components a and B is 1/3), and so on.
Optionally, the step of training a preset bayesian classifier according to the name contrast value, the atomic index contrast value, the statistical granularity contrast value, the statistical period contrast value and the traffic limitation contrast value in combination with the training set comprises:
step C11, obtaining the word number of the same word in the name of the first derived index and the second derived index, and the total word number of the name of the first derived index and the second derived index;
step C12, calculating the ratio of the word number of the same word in the total word number to obtain a name comparison value;
in this embodiment, specifically, the names of the first and second derived indexes are obtained, the number of words of the same word in the name of the first derived index and the name of the second derived index is determined, and the total number of words of the names of the first and second derived indexes is calculated, and the ratio of the number of words of the same word in the total number of words is calculated to obtain a name comparison value, wherein the ratio of the number of words of the same word in the total number of words is calculated by multiplying the number of words of the same word in the name of the first derived index and the name of the second derived index by 2 and dividing the result by the total number of words.
Step C13, determining an atomic index contrast value according to whether the atomic indexes of the first derivative index and the second derivative index are the same;
in this embodiment, specifically, whether the atomic indexes of the first derivative index and the second derivative index are the same is determined, if the atomic indexes of the first derivative index and the second derivative index are the same, an atomic index contrast value is determined to be a preset first value, and if the atomic indexes of the first derivative index and the second derivative index are not the same, the atomic index contrast value is determined to be a preset second value, where the preset first value and the preset second value may be set according to an actual situation, for example, the preset first value is set to be 1 or 0.9, and the preset second value is set to be 0 or 0.05.
Step C14, determining a comparison value of the statistical granularity according to whether the statistical granularity of the first derived index and the second derived index is the same;
in this embodiment, specifically, whether the statistical granularity of the first derived index and the statistical granularity of the second derived index are the same is determined, if the statistical granularity of the first derived index and the statistical granularity of the second derived index are the same, a comparison value of the statistical granularity is determined to be a preset third value, and if the statistical granularity of the first derived index and the statistical granularity of the second derived index are not the same, a comparison value of the atomic index is determined to be a preset fourth value, where the preset third value and the preset fourth value may be set according to an actual situation, for example, the preset third value is set to be 1 or 0.95, and the preset fourth value is set to be 0 or 0.1.
Step C15, obtaining the same statistical period component in the first derivative index and the second derivative index, and all statistical period component in the first derivative index and the second derivative index;
step C16, calculating the ratio of the same statistical period component in all the statistical period components to obtain a statistical period ratio value;
in this embodiment, specifically, the statistical cycle components of the first derivative index and the second derivative index and all the statistical cycle components of the first derivative index and the second derivative index are obtained, the number of the same components in the name of the first derivative index and the statistical cycle components of the second derivative index and the total number of all the statistical cycle components of the first derivative index and the second derivative index are determined, and the ratio of the number of the same statistical cycle components in the total number of all the statistical cycle components is calculated to obtain a statistical cycle contrast value, where the ratio of the number of the same statistical cycle components in the total number of all the statistical cycle components is calculated by comparing the statistical cycle components of the first derivative index and the statistical cycle components of the second derivative index The number of the components is multiplied by 2 and divided by the total number of the components of the total statistical period of the first derivative index and the second derivative index.
Step C17, obtaining the same service restriction component in the first derived index and the second derived index, and all service restriction components in the first derived index and the second derived index;
and step C18, calculating the ratio of the same service restriction component in all the service restriction components to obtain a service restriction comparison value.
In this embodiment, specifically, the business limitation component of each of the first derivative index and the second derivative index and all the business limitation components of the first derivative index and the second derivative index are obtained, the number of the same components of the first derivative index and the business limitation components of the second derivative index and the total number of all the business limitation components of the first derivative index and the second derivative index are determined, and the ratio of the number of the same business limitation components in the total number of all the business limitation components is calculated to obtain a business limitation contrast value, where the ratio of the number of the same business limitation components in the total number of all the business limitation components is calculated by comparing the business limitation component of the first derivative index with the business limitation component of the second derivative index The number of components is multiplied by 2 and divided by the total number of all traffic defining components of the first derivative metric and the second derivative metric.
And step C20, training a preset Bayes classifier according to the name contrast value, the atom index contrast value, the statistic granularity contrast value, the statistic period contrast value and the service limitation contrast value in combination with the training set.
In this embodiment, specifically, the name contrast value, the atomic index contrast value, the statistical granularity contrast value, the statistical period contrast value, and the traffic limitation contrast value are input into a preset bayesian classifier in combination with the training set for training, and it should be noted that a process of training the bayesian classifier in this embodiment is basically the same as that in the prior art, and only differs from that in input data.
Step S23, inputting the first derived index and the second derived index into a trained Bayes classifier, and obtaining the repetition probability of the first derived index and the second derived index.
In this embodiment, specifically, the first derived index and the second derived index are input into a trained bayesian classifier, so as to obtain a repetition probability and a non-repetition probability of the first derived index and the second derived index, and the repetition probability is obtained from an analysis result of the bayesian classifier.
In this embodiment, the bayesian classifier is trained by combining five feature values closely related to the derived indexes with a training set, so that the accuracy of the gesture bayesian classifier in identifying the repeated derived indexes can be effectively improved, and the efficiency and effect of identifying the repeated derived indexes from the data warehouse can be further improved.
Further, an embodiment of the present invention provides an electronic device, where the electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the duplicate derived index identification method in the first embodiment.
Referring now to FIG. 3, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 3, the electronic device may include a processing apparatus (e.g., a central processing unit, a graphic processor, etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage apparatus into a Random Access Memory (RAM). In the RAM, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device, the ROM, and the RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.
Generally, the following systems may be connected to the I/O interface: input devices including, for example, touch screens, touch pads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, and the like; output devices including, for example, Liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices including, for example, magnetic tape, hard disk, etc.; and a communication device. The communication means may allow the electronic device to communicate wirelessly or by wire with other devices to exchange data. While the figures illustrate an electronic device with various systems, it is to be understood that not all illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means, or installed from a storage means, or installed from a ROM. The computer program, when executed by a processing device, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
The electronic equipment provided by the invention adopts the repeated derived index identification method in the embodiment, and solves the technical problem of low efficiency of identifying repeated derived indexes from the data warehouse. Compared with the prior art, the beneficial effects of the electronic device provided by the embodiment of the present invention are the same as the beneficial effects of the repeated derived index identification method provided by the above embodiment, and other technical features of the electronic device are the same as those disclosed in the above embodiment method, which are not described herein again.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the foregoing description of embodiments, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Further, an embodiment of the present invention provides a computer-readable storage medium having stored thereon computer-readable program instructions for performing the method of duplicate derived indicator identification in the above-described embodiment.
The computer readable storage medium provided by the embodiments of the present invention may be, for example, a USB flash disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer-readable storage medium may be embodied in an electronic device; or may be present alone without being incorporated into the electronic device.
The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: when a new request of the derived indexes is detected, receiving the input of the current first derived index, and acquiring a second derived index from a preset data warehouse; inputting the first derived index and the second derived index into a preset Bayes classifier to obtain the repetition probability of the first derived index and the second derived index; and if the repetition probability is greater than or equal to a preset repetition probability threshold value, judging that the first derived index and the second derived index are repeated.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the names of the modules do not in some cases constitute a limitation of the unit itself.
The computer readable storage medium provided by the invention stores computer readable program instructions for executing the above repeated derived index identification method, and solves the technical problem of low efficiency in identifying repeated derived indexes from a data warehouse. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the embodiment of the present invention are the same as the beneficial effects of the repeated derived index identification method provided by the above embodiment, and are not described herein again.
Further, an embodiment of the present invention also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the above repeated derived index identification method.
The computer program product provided by the application solves the technical problem of low efficiency in identifying repeated derived indexes from a data warehouse. Compared with the prior art, the beneficial effects of the computer program product provided by the embodiment of the present invention are the same as the beneficial effects of the repeated derived index identification method provided by the above embodiment, and are not described herein again.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. A repeated derivative index identification method is characterized by comprising the following steps:
when a new request of the derived indexes is detected, receiving the input of the current first derived index, and acquiring a second derived index from a preset data warehouse;
inputting the first derived index and the second derived index into a preset Bayes classifier to obtain the repetition probability of the first derived index and the second derived index;
and if the repetition probability is greater than or equal to a preset repetition probability threshold value, judging that the first derived index and the second derived index are repeated.
2. The method according to claim 1, wherein the step of inputting the first derivative index and the second derivative index into a preset bayesian classifier to obtain a repetition probability of the first derivative index and the second derivative index further comprises:
if the repetition probability is smaller than a preset repetition probability threshold, returning to the execution step: acquiring a second derivative index from a preset data warehouse;
and if the second derivative index which can be obtained does not exist in the data warehouse, judging that the derivative index which is repeated with the first derivative index does not exist in the data warehouse.
3. The method according to claim 1, wherein the step of inputting the first derivative index and the second derivative index into a preset bayesian classifier to obtain a repetition probability of the first derivative index and the second derivative index comprises:
acquiring a manually marked derived index sample as a training set;
training a preset Bayes classifier according to the training set;
inputting the first derived index and the second derived index into a trained Bayes classifier to obtain a repetition probability of the first derived index and the second derived index.
4. The iterative derivative index identification method of claim 3, wherein said step of training a preset Bayesian classifier based on said training set comprises:
determining characteristic values corresponding to the first derivative index and the second derivative index according to a preset rule, wherein the characteristic values comprise a name contrast value, an atom index contrast value, a statistic granularity contrast value, a statistic cycle contrast value and a business limitation contrast value;
and training a preset Bayes classifier according to the name contrast value, the atomic index contrast value, the statistic granularity contrast value, the statistic period contrast value and the service limit contrast value in combination with the training set.
5. The repetitive derivation index identification method of claim 4, wherein the step of training a preset Bayesian classifier in accordance with the name contrast value, the atomic index contrast value, the statistical granularity contrast value, the statistical period contrast value, and the traffic limitation contrast value in combination with the training set comprises:
acquiring the word number of the same word in the names of the first derived index and the second derived index and the total word number of the names of the first derived index and the second derived index;
calculating the ratio of the word number of the same word in the total word number to obtain a name comparison value;
determining an atomic index contrast value according to whether the atomic indexes of the first derivative index and the second derivative index are the same;
determining a comparison value of the statistical granularity according to whether the statistical granularity of the first derived index and the second derived index is the same;
acquiring the same statistical period component elements in the first derivative index and the second derivative index and all statistical period component elements in the first derivative index and the second derivative index;
calculating the ratio of the same statistical period component elements in all the statistical period component elements to obtain a statistical period comparison value;
acquiring the same service restriction component elements in the first derived index and the second derived index and all service restriction component elements in the first derived index and the second derived index;
and calculating the ratio of the same service restriction component in all the service restriction components to obtain a service restriction comparison value.
6. The repetitive derivative indicator identification method of claim 1, wherein after the step of determining that the first derivative indicator and the second derivative indicator are repetitive, the method further comprises:
outputting prompt information identifying a derived metric that duplicates the first derived metric.
7. The repetitive derivation index identification method according to claim 6, wherein after the step of outputting a prompt message identifying the derivation index that repeats with the first derivation index, the method further comprises:
and deleting or merging the repeated first derivative indexes or second derivative indexes based on the user operation received by the prompt message.
8. The repetitive derivation index identification method according to claim 1, further comprising:
when a derived index duplicate checking request is detected, comparing the derived indexes in a preset data warehouse through a preset Bayes classifier;
and deleting or merging each target derived index when at least one repeated group of target derived indexes exist in the preset data warehouse.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the repeated derived metric identification method of any of claims 1 to 8.
10. A readable storage medium having stored thereon a program for implementing a duplicate derived indicator identification method, the program being executed by a processor to implement the steps of the duplicate derived indicator identification method according to any one of claims 1 to 8.
CN202111527387.2A 2021-12-15 2021-12-15 Repeated derivative index identification method, electronic device and readable storage medium Active CN113920381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111527387.2A CN113920381B (en) 2021-12-15 2021-12-15 Repeated derivative index identification method, electronic device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111527387.2A CN113920381B (en) 2021-12-15 2021-12-15 Repeated derivative index identification method, electronic device and readable storage medium

Publications (2)

Publication Number Publication Date
CN113920381A true CN113920381A (en) 2022-01-11
CN113920381B CN113920381B (en) 2022-04-15

Family

ID=79249117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111527387.2A Active CN113920381B (en) 2021-12-15 2021-12-15 Repeated derivative index identification method, electronic device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113920381B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023206875A1 (en) * 2022-04-29 2023-11-02 上海跬智信息技术有限公司 Indicator distance-based indicator deduplication method and apparatus

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2701081A1 (en) * 2012-07-16 2014-02-26 Qatar Foundation A method and system for integrating data into a database
CN106168976A (en) * 2016-07-14 2016-11-30 武汉斗鱼网络科技有限公司 A kind of specific user's method for digging based on NB Algorithm and system
CN107291745A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The management method and device of a kind of data target
CN109191283A (en) * 2018-08-30 2019-01-11 成都数联铭品科技有限公司 Method for prewarning risk and system
CN109947811A (en) * 2017-11-29 2019-06-28 北京京东金融科技控股有限公司 Generic features library generating method and device, storage medium, electronic equipment
CN111488269A (en) * 2019-01-29 2020-08-04 阿里巴巴集团控股有限公司 Index detection method, device and system for data warehouse
US20200358683A1 (en) * 2019-05-10 2020-11-12 Cisco Technology, Inc. Composite key performance indicators for network health monitoring
CN112988698A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Data processing method and device
US20210256002A1 (en) * 2020-02-18 2021-08-19 Freshworks Inc. Integrated system for entity deduplication

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2701081A1 (en) * 2012-07-16 2014-02-26 Qatar Foundation A method and system for integrating data into a database
CN107291745A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The management method and device of a kind of data target
CN106168976A (en) * 2016-07-14 2016-11-30 武汉斗鱼网络科技有限公司 A kind of specific user's method for digging based on NB Algorithm and system
CN109947811A (en) * 2017-11-29 2019-06-28 北京京东金融科技控股有限公司 Generic features library generating method and device, storage medium, electronic equipment
CN109191283A (en) * 2018-08-30 2019-01-11 成都数联铭品科技有限公司 Method for prewarning risk and system
CN111488269A (en) * 2019-01-29 2020-08-04 阿里巴巴集团控股有限公司 Index detection method, device and system for data warehouse
US20200358683A1 (en) * 2019-05-10 2020-11-12 Cisco Technology, Inc. Composite key performance indicators for network health monitoring
CN112988698A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Data processing method and device
US20210256002A1 (en) * 2020-02-18 2021-08-19 Freshworks Inc. Integrated system for entity deduplication

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023206875A1 (en) * 2022-04-29 2023-11-02 上海跬智信息技术有限公司 Indicator distance-based indicator deduplication method and apparatus

Also Published As

Publication number Publication date
CN113920381B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN109299348B (en) Data query method and device, electronic equipment and storage medium
US11232117B2 (en) Apparatuses, methods and systems for relevance scoring in a graph database using multiple pathways
US9495099B2 (en) Space-time-node engine signal structure
CN110704751B (en) Data processing method and device, electronic equipment and storage medium
CN110865898B (en) Method, device, medium and equipment for converging crash call stack
US9015608B2 (en) Regenerating a user interface area
CN111401228B (en) Video target labeling method and device and electronic equipment
CN111125266A (en) Data processing method, device, equipment and storage medium
CN113920381B (en) Repeated derivative index identification method, electronic device and readable storage medium
CN111367813B (en) Automatic testing method and device for decision engine, server and storage medium
CN109542743B (en) Log checking method and device, electronic equipment and computer readable storage medium
US20150169515A1 (en) Data driven synthesizer
CN110554892A (en) Information acquisition method and device
CN113688133A (en) Data processing method, system, device, medium and equipment based on compliance calculation
CN114116431A (en) System operation health detection method and device, electronic equipment and readable storage medium
CN113485890A (en) Flight inquiry system service monitoring method, device, equipment and storage medium
CN111562749A (en) AI-based general intelligent household scheme automatic design method and device
CN111694833B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN114489660B (en) Buried point code adding method, device and equipment
CN113515332B (en) Data generation method, device, equipment and storage medium
CN114064494A (en) Data abnormity alarm method and device, electronic equipment and computer readable medium
CN116719863A (en) Data display method, device, equipment, storage medium and product
CN115658612A (en) Object renaming method and device, electronic equipment and storage medium
CN117539529A (en) Micro-service interface quantitative management method and platform
CN111832304A (en) Method and device for checking duplicate of building name, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant