CN113626671A - Data classification method, device and equipment based on character matching and storage medium - Google Patents

Data classification method, device and equipment based on character matching and storage medium Download PDF

Info

Publication number
CN113626671A
CN113626671A CN202110924846.4A CN202110924846A CN113626671A CN 113626671 A CN113626671 A CN 113626671A CN 202110924846 A CN202110924846 A CN 202110924846A CN 113626671 A CN113626671 A CN 113626671A
Authority
CN
China
Prior art keywords
data
classification
service
model
radius
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110924846.4A
Other languages
Chinese (zh)
Inventor
谢峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110924846.4A priority Critical patent/CN113626671A/en
Publication of CN113626671A publication Critical patent/CN113626671A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data classification method based on character matching, which comprises the following steps: acquiring service data to be classified; performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model; and classifying the business data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model. Therefore, the method and the device can realize the classification of the service data through a simple character matching mode, reduce the complex data analysis process carried out in the data classification, ensure the accuracy of the data classification and simultaneously give consideration to the efficiency of the data classification. The invention also relates to the technical field of block chains.

Description

Data classification method, device and equipment based on character matching and storage medium
Technical Field
The invention relates to the technical field of data modeling, in particular to a data classification method and device based on character matching, computer equipment and a storage medium.
Background
With the arrival of the information-oriented society, information technologies have gradually penetrated into the daily life of human beings, bringing great convenience to the daily life of human beings, for example, the current information technologies such as communication technology, artificial intelligence technology, internet of things technology and the like create better living conditions for human beings. While information technology is widely used, it is accompanied by the generation of a large amount of data, which is processed by big data technology, that is, capable of providing various data services to users. In big data technology, data modeling of source data is a key ring. In practical applications, source data are often scattered (for example, data types are not uniform, data contents lack a uniform standard, and the like), and if the source data are directly used for providing data services, it is obviously impossible to efficiently utilize data, and it is impossible to provide high-quality data services. Therefore, data modeling of source data is often required before providing technical services to users using big data technology.
Data modeling is a relatively complex task, and different industrial applications generally have different modeling requirements, for example, in some industrial applications, business data is generally required to be classified into various data classifications of a data model when data modeling is performed. In the process of classifying the service data, in order to accurately classify the service data into a proper data classification, it is often necessary to perform a relatively complex analysis on the service data (e.g., performing semantic understanding on characters in the service data, clustering the service data, etc.). Generally, the more complicated the analysis process of the business data, the more accurate the classification result is, but the more complicated the analysis process also means that a great deal of calculation is required in the analysis process, which will result in the reduction of the data classification efficiency. At present, a data classification method capable of better considering both classification efficiency and classification accuracy is needed in the prior art.
Disclosure of Invention
The invention aims to solve the technical problem that the existing data classification method cannot well take account of the efficiency and the accuracy of data classification.
In order to solve the above technical problem, a first aspect of the present invention discloses a data classification method based on character matching, including:
acquiring service data to be classified;
performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model, wherein a plurality of data classifications are preset in the data model, and each data in the data model is divided into each data classification in advance;
classifying the business data into the target data classification in the data model according to the corresponding target data classification of the matching data in the data model;
the matching data is identical data or approximate data, the identical data refers to data which is completely consistent with characters of the service data in the data model, and the approximate data refers to data which is not completely consistent with the characters of the service data in the data model and contains all the characters in the service data.
The second aspect of the present invention discloses a data classification device based on character matching, the device comprising:
the acquisition module is used for acquiring the service data to be classified;
the matching module is used for performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model, wherein a plurality of data classifications are preset in the data model, and each data in the data model is pre-classified into each data classification;
the classification module is used for classifying the business data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model;
the matching data is identical data or approximate data, the identical data refers to data which is completely consistent with characters of the service data in the data model, and the approximate data refers to data which is not completely consistent with the characters of the service data in the data model and contains all the characters in the service data.
A third aspect of the present invention discloses a computer apparatus, comprising:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to execute part or all of the steps in the data classification method based on character matching disclosed by the first aspect of the invention.
In a fourth aspect of the present invention, a computer storage medium is disclosed, wherein the computer storage medium stores computer instructions, and when the computer instructions are called, the computer instructions are used to perform some or all of the steps of the data classification method based on character matching disclosed in the first aspect of the present invention.
In the embodiment of the invention, firstly, business data to be classified is obtained, then characters in the business data are matched with characters in each data in the existing data model to obtain matched data corresponding to the business data, finally, the business data are classified into the data classification of the data model according to the data classification of the matched data in the data model, therefore, when the data classification of the service data is carried out, the matching data corresponding to the service data is matched from the data model in a character matching mode, then classifying the business data to be classified according to the data classification of the matching data in the data model, therefore, classification of the service data can be realized in a simple character matching mode, the complex data analysis process carried out in data classification is reduced, and the efficiency of the data classification is considered while the accuracy of the data classification is ensured.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a data classification method based on character matching according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a data classification apparatus based on character matching according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The invention discloses a data classification method, a device, computer equipment and a storage medium based on character matching, which are characterized in that firstly, business data to be classified is obtained, then characters in the business data are matched with characters in each data in an existing data model to obtain matching data corresponding to the business data, and finally, the business data are classified into the data classification of the data model according to the data classification of the matching data in the data model, so that when the data classification of the business data is carried out, the matching data corresponding to the business data are matched from the data model in a character matching mode, then the business data to be classified are classified according to the data classification of the matching data in the data model, thereby realizing the classification of the business data in a simple character matching mode, reducing the complex data analysis process carried out in the data classification, the data classification efficiency is considered while the accuracy of the data classification is ensured. The following are detailed below.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a data classification method based on character matching according to an embodiment of the present invention. As shown in fig. 1, the data classification method based on character matching may include the following operations:
101. and acquiring the service data to be classified.
In step 101, the business data to be classified may be relevant data crawled from the internet or data obtained from a specified system. For example, in data modeling for some industrial applications, case data can be crawled from websites such as the chinese judge paper web and the north lawboy as business data, and the business data can also be obtained from an internal management system of the industrial applications.
102. And performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model, wherein a plurality of data classifications are preset in the data model, each data in the data model is pre-classified into each data classification, the matched data is identical data or approximate data, the identical data refers to data which is completely identical with characters of the service data in the data model, and the approximate data refers to data which is not completely identical with the characters of the service data in the data model and contains all the characters in the service data.
In step 102, an existing data model typically includes a plurality of data layers, each for storing a corresponding level of classified data.
Preferably, the embodiment may be applied to data models applied in some industries, and specifically may include three data layers, where one data layer is a first-level service classification for storing data of the first-level service classification, two data layers are second-level service classifications for storing data of the second-level service classification, and three data layers are third-level service classifications for storing data of the third-level service classification. The primary traffic class may include A, B, C, D, E, F, G, H, I, etc. The secondary service classification is a classification continued under the primary service classification, for example, the secondary service classification corresponding to the primary service classification a may include 19 secondary service classifications, such as AA, AB, and AC. The third-level service classification is a classification performed continuously under the second-level service classification, for example, the third-level service classification corresponding to the second-level service classification AA may include 16 third-level service classifications, such as AAA, AAB, AAC, and the like. Assuming that the classified first-level service of the data a in the data modeling process is classified as A, the classified second-level service is classified as AA, and the classified third-level service is classified as AAA, the data a is stored in the A classification of a first data layer, the AA classification of a second data layer and the AAA classification of a third data layer of the data model. The process of performing the matching operation on the business data based on the data model may be a process of searching for existing data in the data model, which is the same as the business data, and may be implemented by a character matching method or a numerical matching method. For example, the matching operation is performed by character matching, and may be performed by matching characters in the service data with characters in existing data in a data model, if the characters in the existing data in the data model are completely consistent with the characters in the service data, it is determined that the service data has the same matched data in the data model, if the characters in the existing data in the data model include all characters in the service data, and the characters in the existing data in the data model have redundant characters with respect to all characters in the service data, it is determined that the service data has similar matched data in the data model, for example, the existing data a in the data model is a field name "propose XX propose department type _ code", and if the service data also proposes XX propose department type _ code ", if the service data is the field name "propose XX suggestion department type", the data a contains the service data, and the data a still has redundant character "_ code" relative to the service data, it can be determined that the service data has matching approximate data in the data model, and if the service data is the field name "propose XX suggestion department type _ name", the data a fails to contain all characters in the service data, it can be determined that the service data does not have matching data in the data model.
103. And classifying the business data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model.
Optionally, the classifying the service data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model includes:
when the matched data is the same data, determining the data classification of the same data in the data model as the target data classification, and classifying the service data into the target data classification;
and when the matched data is the approximate data, searching similar data corresponding to the business data in the data model according to a preset searching mode, determining the target data classification based on the data classification of the similar data in the data model, and classifying the business data into the target data classification.
In the step 103, taking XX industry application as an example, assuming that data a is a field name "propose XX suggestion department type _ code", if the service data is also a field name "propose XX suggestion department type _ code", it may be determined that the service data matches data a in the data model (that is, the same data matching the service data is data a), and the service data is classified into the same classification as the data a (that is, the first-level service is classified as a, the second-level service is classified as AA, and the third-level service is classified as AAA), so that classification of the service data in the data model may be completed, and classification of the service data is realized. If the business data is the field name 'propose XX suggestion department type _ name', and the field name 'propose XX suggestion department type _ name' does not exist in the data model, the business data can be deleted and not added into the data model. If the service data is the field name "propose XX recommended department type", the service data is not completely the same as the data a, but is partially similar, that is, the data a is approximate data matched with the service data, at this time, the similar data corresponding to the service data can be searched from the data model. If the found similar data of the service data includes the field name "propose XX suggestion department type _ name" and the field name "propose XX suggestion department type _ code", the service data and the similar data can be both pushed to a preset terminal (i.e., similar processing operation), and the service data is classified manually according to the service data and the similar data, so that modeling of the service data is realized. If the operator checks the service data after receiving the service data and the similar data, and confirms that the service data is more matched with the field name "propose XX recommended department type _ code", the service data can be classified into the same classification as the field name "propose XX recommended department type _ code", so that the classification of the service data is realized.
It can be seen that, implementing the data classification method based on character matching described in fig. 1, first obtaining service data to be classified, then matching characters in the service data with characters in each data in an existing data model to obtain matching data corresponding to the service data, and finally classifying the service data into data classification of the data model according to the data classification of the matching data in the data model, so that when the data classification of the service data is performed, matching data corresponding to the service data is matched from the data model in a character matching manner, and then classifying the service data to be classified according to the data classification of the matching data in the data model, thereby realizing the classification of the service data in a simple character matching manner, reducing complex data analysis processes performed in the data classification, and while ensuring the accuracy of the data classification, and the efficiency of data classification is considered.
In an optional embodiment, the searching for similar data corresponding to the service data in the data model according to a preset searching manner includes:
respectively mapping the service data and the data in the data model into data vectors in a data vector space;
and screening out similar data corresponding to the service data from the data of the data model based on the data vectors respectively corresponding to the service data and the data in the data model.
In this alternative embodiment, the forms of the business data and the data in the data model cannot be directly compared, so the business data and the data in the data model can be mapped to data vectors in a data vector space, so that the comparison between the business data and the data in the data model can be realized, and similar data can be screened from the data in the data model. For example, the data vector mapped by the data "propose XX suggestion division" is (3,2), and the data vectors mapped by the data "propose XX suggestion division type _ name", "propose XX suggestion division code" and "propose XX suggestion division type _ code" are (3,2.1), (3,2.2), (3,2.22), respectively. The data can be mapped into a data vector in a data vector space in a word embedding mode.
Therefore, by implementing the optional embodiment, the service data and the data in the data model are both mapped into the data vector in the data vector space, so that the similar data corresponding to the service data can be screened out based on the data vectors corresponding to the service data and the data in the data model.
In an optional embodiment, the screening out similar data corresponding to the business data from the data of the data model based on data vectors corresponding to the business data and the data in the data model respectively includes:
determining a current radius according to a historical radius, wherein the historical radius is the radius determined in the process of screening similar data last time, and the current radius is larger than the historical radius;
determining a data vector range according to a service data vector and the current radius, wherein the service data vector is a vector corresponding to the service data in the data vector space, and the data vector range is a circular range which is in the data vector space, takes the service data vector as a center and takes the current radius as a radius;
judging whether the number of the data vectors in the data vector range is larger than that of the data vectors in a historical data vector range, wherein the historical data vector range is a circular range which takes the service data vector as a center and the historical radius as a radius in the data vector space;
when the number of the data vectors in the data vector range is not larger than the number of the data vectors in the historical data vector range, determining the data corresponding to the data vectors in the data vector range in the data model as similar data corresponding to the business data.
In this alternative embodiment, after mapping the business data and the data in the data model into data vectors in the data vector space, similar data can be screened out from the data in the data model by the following process: first, an initial radius may be preset, then the radius is continuously increased, and when the radius is increased each time, it is determined whether the number of data vectors in the data vector range corresponding to the sub-radius is greater than the number of data vectors in the data vector range corresponding to the previous radius (i.e., the historical data vector range), that is, whether the number of data vectors in the data vector range corresponding to the sub-radius is newly increased relative to the number of data vectors in the data vector range corresponding to the previous radius, if the number of data vectors in the data vector range corresponding to the sub-radius is not greater than the number of data vectors in the data vector range corresponding to the previous radius (i.e., the number of data vectors in the data vector range corresponding to the sub-radius is not newly increased relative to the number of data vectors in the data vector range corresponding to the previous radius), if the number of the data vectors in the data vector range corresponding to the minor radius is greater than the number of the data vectors in the data vector range corresponding to the previous radius (i.e., the data vectors in the data vector range corresponding to the minor radius are newly increased relative to the data vectors in the data vector range corresponding to the previous radius), the radius is continuously increased, and the next round of screening is performed, so that the similar data corresponding to the service data can be screened from the data of the data model. Wherein the calculation of the distance between the data vectors may be performed by calculating euclidean distances between the data vectors.
It can be seen that, by implementing the optional embodiment, the radius of the data vector range is continuously increased, and when the radius is increased each time, whether the number of the data vectors in the data vector range corresponding to the next radius is greater than the number of the data vectors in the data vector range corresponding to the previous radius is determined, and if not, the data corresponding to the data vectors in the data vector range corresponding to the next radius is determined as the similar data of the service data, so that the similar data corresponding to the service data can be screened out from the data of the data model based on the service data and the data vectors corresponding to the data in the data model.
In an optional embodiment, the determining a current radius according to the historical radius includes:
calculating the current radius from the historical radius in an exponential growth manner by the following formula:
y=loga x;
z=ay+1
wherein y is an index value corresponding to the historical radius, a is a preset base number, x is the historical radius, and z is the current radius.
In this alternative embodiment, the radius of the data vector range may be increased exponentially, so that the filtering of similar data can be done quickly by using the characteristics of exponential increase. For example, the first radius is 2, the second radius is 4, and the third radius is 8. Optionally, the radius value may also be increased gradually by a fixed value, for example, the first radius value is 2, the second radius value is 4, and the third radius value is 6, and each radius is increased by 2.
It can be seen that this alternative embodiment is implemented to compute the current radius based on the historical radius in an exponentially growing manner, thereby enabling similar data screening to be done quickly using exponentially growing characteristics.
In an optional embodiment, after the obtaining of the service data to be classified, the method further includes:
and executing preset data standardization processing on the service data to finish the standardization of the service data.
In this optional embodiment, in practical applications, the data forms of the acquired service data are usually various and lack of a unified standard, which is not favorable for serving different application systems according to the unified standard, and is easy to reduce the data use efficiency. Therefore, when the business data is subjected to data modeling, data standardization processing (specifically described later) can be further performed on the business data, so that the business data can serve different application systems according to a unified standard, and the data use efficiency is improved.
Therefore, by implementing the optional embodiment, when data modeling is performed on the business data, data standardization processing is also performed on the business data, so that the business data can serve different application systems according to a unified standard, and the service efficiency of the business data is improved.
In an optional embodiment, the performing a preset data normalization process on the service data includes:
judging whether the business data has a table structure Chinese field name or not;
when the business data is judged to have the Chinese field name of the table structure, converting the Chinese field name of the table structure in the business data into the English field name of the table structure according to a preset Chinese and English conversion mode;
judging whether the table structure English field names of the business data are in a preset special conversion table or not, wherein a plurality of target table structure English field names and a special conversion mode corresponding to each target table structure English field name are recorded in the special conversion table, and the target table structure English field names refer to table structure English field names needing to execute special conversion;
and when the table structure English field name of the business data is judged to be in the special conversion table, converting the table structure English field name of the business data into a special table structure English field name according to a target special conversion mode, wherein the target special conversion mode refers to a special conversion mode corresponding to the table structure English field name of the business data in the special conversion table.
In this alternative embodiment, the open source python packet pypinyin may be used to convert the table structure chinese field names in the service data into table structure english field names. Specifically, the pinyin initial of each chinese character in the table-structured chinese field names may be used as the english corresponding to the chinese character to form the table-structured english field names corresponding to the table-structured chinese field names (i.e., the chinese-english conversion manner), for example, the table-structured english field name converted from the table-structured chinese field name "XX proposing recommendation department type _ code" is "TCJCJYBMLX _ DM". And as for the English field names of the part of the specific table structure, special self-defined conversion can be continuously carried out on the English field names. The contents of the special conversion table may be as follows:
Figure RE-GDA0003254260270000101
Figure RE-GDA0003254260270000111
it can be seen from the contents of the special conversion table that after the table structure Chinese field name "abort" is converted into the table structure English field name "ZZ", the table structure English field name "ZZ" also needs to be converted into the special table structure English field name "ZZM".
Therefore, by implementing the optional embodiment, the names of the Chinese character fields in the table structure in the service data are converted into the names of the English field in the table structure, and the names of the English field in the table structure are further subjected to special conversion according to the preset special conversion table, so that the data standardization processing of the service data is realized, the service data can serve different application systems according to a unified standard, and the service efficiency of the service data is improved.
In an optional embodiment, the performing a preset data normalization process on the service data includes:
judging whether merged data exist in the service data or not;
when the merged data exist in the service data, resetting the data type of the merged data according to the data type of the source data corresponding to the merged data;
the source data refers to original data in an upstream system, and the merged data is data obtained by merging the source data.
In this alternative embodiment, a consolidated data is typically consolidated from the original data (i.e., source data) in multiple upstream systems. If the merged data is character data, the data type corresponding to the source data with the largest character length in all the source data corresponding to the merged data can be selected as the data type of the merged data. If the merged data is timestamp data, a data type corresponding to source data with the minimum time precision in all source data corresponding to the merged data may be selected as the data type of the merged data. For example, the source data corresponding to the merged data includes a date data type and a timestamp data type, where the time precision of the timestamp data type can be as accurate as nanoseconds, and the time precision of the date data type can be as accurate as seconds, so the timestamp data type is selected as the data type of the merged data. If the merged data is digital data, the data type corresponding to the source data with the minimum data precision in all the source data corresponding to the merged data can be selected as the data type of the merged data.
Therefore, by implementing the optional embodiment, the data type of the merged data is reset according to the data type of the source data corresponding to the merged data in the service data, so that the data standardization processing of the service data is realized, the service data can serve different application systems according to a unified standard, and the service efficiency of the service data is improved.
Optionally, it is also possible: and uploading the data classification information based on character matching of the data classification method based on character matching to a block chain.
Specifically, the character matching-based data classification information is obtained by operating the character matching-based data classification method, and is used for recording character matching-based data classification conditions, such as acquisition time of business data, data source of the business data, data classification in a data model, and the like. Uploading data classification information based on character matching to the blockchain can ensure the safety and the fair transparency to users. The user may download the character matching-based data classification information from the blockchain to verify whether the character matching-based data classification information of the character matching-based data classification method is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a data classification device based on character matching according to an embodiment of the present invention. As shown in fig. 2, the data classification apparatus based on character matching may include:
an obtaining module 201, configured to obtain service data to be classified;
a matching module 202, configured to perform character matching on the service data and each data in a preset data model to obtain matching data in the data model, where the data model is preset with a plurality of data classifications, and each data in the data model is pre-classified into each data classification;
the classification module 203 is configured to classify the service data into a target data classification in the data model according to a target data classification corresponding to the matching data in the data model;
the matching data is identical data or approximate data, the identical data refers to data which is completely consistent with characters of the service data in the data model, and the approximate data refers to data which is not completely consistent with the characters of the service data in the data model and contains all the characters in the service data.
In an optional embodiment, the classification module 203 classifies the service data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model in a specific manner:
when the matched data is the same data, determining the data classification of the same data in the data model as the target data classification, and classifying the service data into the target data classification;
and when the matched data is the approximate data, searching similar data corresponding to the business data in the data model according to a preset searching mode, determining the target data classification based on the data classification of the similar data in the data model, and classifying the business data into the target data classification.
In an optional embodiment, a specific way for the classification module 203 to search for similar data corresponding to the service data in the data model according to a preset search way is as follows:
respectively mapping the service data and the data in the data model into data vectors in a data vector space;
and screening out similar data corresponding to the service data from the data of the data model based on the data vectors respectively corresponding to the service data and the data in the data model.
In an optional embodiment, the classifying module 203 screens out similar data corresponding to the service data from the data of the data model based on data vectors corresponding to the service data and the data in the data model in a specific manner that:
determining a current radius according to a historical radius, wherein the historical radius is the radius determined in the process of screening similar data last time, and the current radius is larger than the historical radius;
determining a data vector range according to a service data vector and the current radius, wherein the service data vector is a vector corresponding to the service data in the data vector space, and the data vector range is a circular range which is in the data vector space, takes the service data vector as a center and takes the current radius as a radius;
judging whether the number of the data vectors in the data vector range is larger than that of the data vectors in a historical data vector range, wherein the historical data vector range is a circular range which takes the service data vector as a center and the historical radius as a radius in the data vector space;
when the number of the data vectors in the data vector range is not larger than the number of the data vectors in the historical data vector range, determining the data corresponding to the data vectors in the data vector range in the data model as similar data corresponding to the business data.
In an optional embodiment, the specific manner for determining the current radius according to the historical radius by the classification module 203 is as follows:
calculating the current radius from the historical radius in an exponential growth manner by the following formula:
y=loga x;
z=ay+1
wherein y is an index value corresponding to the historical radius, a is a preset base number, x is the historical radius, and z is the current radius.
In an optional embodiment, the apparatus further comprises:
and the data standardization processing module is used for executing preset data standardization processing on the service data so as to complete the standardization of the service data.
In an optional embodiment, the specific way for the data normalization processing module to perform the preset data normalization processing on the service data is as follows:
judging whether the business data has a table structure Chinese field name or not;
when the business data is judged to have the Chinese field name of the table structure, converting the Chinese field name of the table structure in the business data into the English field name of the table structure according to a preset Chinese and English conversion mode;
judging whether the table structure English field names of the business data are in a preset special conversion table or not, wherein a plurality of target table structure English field names and a special conversion mode corresponding to each target table structure English field name are recorded in the special conversion table, and the target table structure English field names refer to table structure English field names needing to execute special conversion; (ii) a
And when the table structure English field name of the business data is judged to be in the special conversion table, converting the table structure English field name of the business data into a special table structure English field name according to a target special conversion mode, wherein the target special conversion mode refers to a special conversion mode corresponding to the table structure English field name of the business data in the special conversion table.
In an optional embodiment, the specific way for the data normalization processing module to perform the preset data normalization processing on the service data is as follows:
judging whether merged data exist in the service data or not;
when the merged data exist in the service data, resetting the data type of the merged data according to the data type of the source data corresponding to the merged data;
the source data refers to original data in an upstream system, and the merged data is data obtained by merging the source data. For the specific description of the data classification device based on the character matching, reference may be made to the specific description of the data classification method based on the character matching, and for avoiding repetition, the detailed description is omitted here.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer apparatus may include:
a memory 301 storing executable program code;
a processor 302 connected to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute the steps of the data classification method based on character matching disclosed in the embodiment of the present invention.
Example four
Referring to fig. 4, an embodiment of the present invention discloses a computer storage medium 401, where the computer storage medium 401 stores computer instructions, and the computer instructions, when called, are used to execute steps in a data classification method based on character matching disclosed in an embodiment of the present invention.
The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
Finally, it should be noted that: the data classification method, apparatus, computer device and storage medium based on character matching disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A data classification method based on character matching is characterized by comprising the following steps:
acquiring service data to be classified;
performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model, wherein a plurality of data classifications are preset in the data model, and each data in the data model is divided into each data classification in advance;
classifying the business data into the target data classification in the data model according to the corresponding target data classification of the matching data in the data model;
the matching data is identical data or approximate data, the identical data refers to data which is completely consistent with characters of the service data in the data model, and the approximate data refers to data which is not completely consistent with the characters of the service data in the data model and contains all the characters in the service data.
2. The character matching-based data classification method according to claim 1, wherein the classifying the business data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model comprises:
when the matched data is the same data, determining the data classification of the same data in the data model as the target data classification, and classifying the service data into the target data classification;
and when the matched data is the approximate data, searching similar data corresponding to the business data in the data model according to a preset searching mode, determining the target data classification based on the data classification of the similar data in the data model, and classifying the business data into the target data classification.
3. The data classification method based on character matching according to claim 2, wherein the searching for similar data corresponding to the service data in the data model according to a preset searching manner includes:
respectively mapping the service data and the data in the data model into data vectors in a data vector space;
and screening out similar data corresponding to the service data from the data of the data model based on the data vectors respectively corresponding to the service data and the data in the data model.
4. The method for classifying data based on character matching according to claim 3, wherein the step of screening out similar data corresponding to the business data from the data of the data model based on the data vectors corresponding to the business data and the data in the data model respectively comprises:
determining a current radius according to a historical radius, wherein the historical radius is the radius determined in the process of screening similar data last time, and the current radius is larger than the historical radius;
determining a data vector range according to a service data vector and the current radius, wherein the service data vector is a vector corresponding to the service data in the data vector space, and the data vector range is a circular range which is in the data vector space, takes the service data vector as a center and takes the current radius as a radius;
judging whether the number of the data vectors in the data vector range is larger than that of the data vectors in a historical data vector range, wherein the historical data vector range is a circular range which takes the service data vector as a center and the historical radius as a radius in the data vector space;
when the number of the data vectors in the data vector range is not larger than the number of the data vectors in the historical data vector range, determining the data corresponding to the data vectors in the data vector range in the data model as similar data corresponding to the business data.
5. The method for classifying data based on character matching according to claim 4, wherein the determining a current radius according to a historical radius comprises:
calculating the current radius from the historical radius in an exponential growth manner by the following formula:
y=loga x;
z=ay+1
wherein y is an index value corresponding to the historical radius, a is a preset base number, x is the historical radius, and z is the current radius.
6. The method for data classification based on character matching according to any one of claims 1-5, wherein after the obtaining of the service data to be classified, the method further comprises:
judging whether the business data has a table structure Chinese field name or not;
when the business data is judged to have the Chinese field name of the table structure, converting the Chinese field name of the table structure in the business data into the English field name of the table structure according to a preset Chinese and English conversion mode;
judging whether the table structure English field names of the business data are in a preset special conversion table or not, wherein a plurality of target table structure English field names and a special conversion mode corresponding to each target table structure English field name are recorded in the special conversion table, and the target table structure English field names refer to table structure English field names needing to execute special conversion;
and when the table structure English field name of the business data is judged to be in the special conversion table, converting the table structure English field name of the business data into a special table structure English field name according to a target special conversion mode, wherein the target special conversion mode refers to a special conversion mode corresponding to the table structure English field name of the business data in the special conversion table.
7. The method for data classification based on character matching according to any one of claims 1-5, wherein after the obtaining of the service data to be classified, the method further comprises: judging whether merged data exist in the service data or not;
when the merged data exist in the service data, resetting the data type of the merged data according to the data type of the source data corresponding to the merged data;
the source data refers to original data in an upstream system, and the merged data is data obtained by merging the source data.
8. An apparatus for classifying data based on character matching, the apparatus comprising:
the acquisition module is used for acquiring the service data to be classified;
the matching module is used for performing character matching on the service data and each data in a preset data model to obtain matched data matched with the service data in the data model, wherein a plurality of data classifications are preset in the data model, and each data in the data model is pre-classified into each data classification;
the classification module is used for classifying the business data into the target data classification in the data model according to the target data classification corresponding to the matching data in the data model;
the matching data is identical data or approximate data, the identical data refers to data which is completely consistent with characters of the service data in the data model, and the approximate data refers to data which is not completely consistent with the characters of the service data in the data model and contains all the characters in the service data.
9. A computer device, characterized in that the computer device comprises:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to execute the character matching-based data classification method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for character matching based data classification according to any one of claims 1 to 7.
CN202110924846.4A 2021-08-12 2021-08-12 Data classification method, device and equipment based on character matching and storage medium Pending CN113626671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110924846.4A CN113626671A (en) 2021-08-12 2021-08-12 Data classification method, device and equipment based on character matching and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110924846.4A CN113626671A (en) 2021-08-12 2021-08-12 Data classification method, device and equipment based on character matching and storage medium

Publications (1)

Publication Number Publication Date
CN113626671A true CN113626671A (en) 2021-11-09

Family

ID=78384882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110924846.4A Pending CN113626671A (en) 2021-08-12 2021-08-12 Data classification method, device and equipment based on character matching and storage medium

Country Status (1)

Country Link
CN (1) CN113626671A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150331903A1 (en) * 2014-05-19 2015-11-19 The Travelers Indemnity Company System for classification code selection
CN107783950A (en) * 2017-04-11 2018-03-09 平安医疗健康管理股份有限公司 Package insert processing method and processing device
CN109408561A (en) * 2018-10-17 2019-03-01 杭州骑轻尘信息技术有限公司 Business Name matching process and device
CN110069631A (en) * 2019-04-08 2019-07-30 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN111522902A (en) * 2020-03-25 2020-08-11 中国平安人寿保险股份有限公司 Data entry method and device, electronic equipment and computer readable storage medium
CN111708884A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Text classification method and device and electronic equipment
CN112307209A (en) * 2020-11-05 2021-02-02 江西高创保安服务技术有限公司 Short text classification method and system based on character vectors
CN112632292A (en) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 Method, device and equipment for extracting service keywords and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150331903A1 (en) * 2014-05-19 2015-11-19 The Travelers Indemnity Company System for classification code selection
CN107783950A (en) * 2017-04-11 2018-03-09 平安医疗健康管理股份有限公司 Package insert processing method and processing device
CN109408561A (en) * 2018-10-17 2019-03-01 杭州骑轻尘信息技术有限公司 Business Name matching process and device
CN110069631A (en) * 2019-04-08 2019-07-30 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN111522902A (en) * 2020-03-25 2020-08-11 中国平安人寿保险股份有限公司 Data entry method and device, electronic equipment and computer readable storage medium
CN111708884A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Text classification method and device and electronic equipment
CN112307209A (en) * 2020-11-05 2021-02-02 江西高创保安服务技术有限公司 Short text classification method and system based on character vectors
CN112632292A (en) * 2020-12-23 2021-04-09 深圳壹账通智能科技有限公司 Method, device and equipment for extracting service keywords and storage medium

Similar Documents

Publication Publication Date Title
CN110352425A (en) The cognition supervision compliance automation of block chain transaction
EP2315132A2 (en) System and method for searching and matching databases
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN114930318A (en) Classifying data using aggregated information from multiple classification modules
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN112650858B (en) Emergency assistance information acquisition method and device, computer equipment and medium
CN111814482A (en) Text key data extraction method and system and computer equipment
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
CN115237802A (en) Artificial intelligence based simulation test method and related equipment
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN113836128A (en) Abnormal data identification method, system, equipment and storage medium
CN111209753A (en) Entity naming identification method and device
CN112925898A (en) Question-answering method, device, server and storage medium based on artificial intelligence
CN113436614A (en) Speech recognition method, apparatus, device, system and storage medium
CN112559526A (en) Data table export method and device, computer equipment and storage medium
CN111460139B (en) Intelligent management based engineering supervision knowledge service system and method
CN110019762A (en) A kind of positioning problems method, storage medium and server
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112069808A (en) Financing wind control method and device, computer equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
US11886467B2 (en) Method, apparatus, and computer-readable medium for efficiently classifying a data object of unknown type
CN113626671A (en) Data classification method, device and equipment based on character matching and storage medium
CN106598983A (en) Information display method and device
CN112541357B (en) Entity identification method and device and intelligent equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination