CN117952092A - Data service type identification method, device, computer equipment and storage medium - Google Patents

Data service type identification method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117952092A
CN117952092A CN202311823350.3A CN202311823350A CN117952092A CN 117952092 A CN117952092 A CN 117952092A CN 202311823350 A CN202311823350 A CN 202311823350A CN 117952092 A CN117952092 A CN 117952092A
Authority
CN
China
Prior art keywords
pinyin
result
data
service type
automaton
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311823350.3A
Other languages
Chinese (zh)
Inventor
柳遵梁
胡定鹏
顾寅红
闻建霞
金蒙奇
朱文宇
李大骞
刘传志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Meichuang Technology Co ltd
Original Assignee
Hangzhou Meichuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Meichuang Technology Co ltd filed Critical Hangzhou Meichuang Technology Co ltd
Priority to CN202311823350.3A priority Critical patent/CN117952092A/en
Publication of CN117952092A publication Critical patent/CN117952092A/en
Pending legal-status Critical Current

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention discloses a data service type identification method, a data service type identification device, computer equipment and a storage medium. The method comprises the following steps: acquiring metadata; constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton; acquiring data to be identified; and identifying the service type of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton so as to obtain an identification result and outputting the identification result. By implementing the method provided by the embodiment of the invention, the identification accuracy and the identification efficiency of the data service type can be improved, and the labor cost is reduced.

Description

Data service type identification method, device, computer equipment and storage medium
Technical Field
The present invention relates to a data processing method, and more particularly, to a data service type identification method, apparatus, computer device, and storage medium.
Background
In the development process of enterprises, the types of services are continuously increased, and the data has the characteristics of high complexity and strong diversity. Therefore, the enterprises are required to classify and grade the data, so that the data assets owned by the enterprises are comprehensively combed, and the safety of important data is ensured. At present, the data classification and classification are also explicitly described and indicated, and the importance and the necessity of the data classification and classification are further reflected. The data classification hierarchy comprises two parts: data marking and classification grading. The data marking is to identify the service type of the field through metadata; the traffic type of data is a tag that indicates the meaning of a field, which is already defined, for example: if a column is labeled with a business type label of "doctor name", it means that the column has the meaning of "doctor name"; classification and grading are to classify the fields with the identified service types and assign the sensitivity level; data marking is the most important step in classification as to whether the correct traffic type can be identified.
Although a field will contain field comments, the field comments are detailed explanations of the field, and may lack comment information due to problems such as long time, personnel replacement, and design non-standardization, and there are cases where many sample data cannot use a dictionary and the type of service of the data can be recognized regularly.
Therefore, a new method is needed to be designed, so that the identification accuracy and the identification efficiency of the data service type are improved, and the labor cost is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a data service type identification method, a data service type identification device, computer equipment and a storage medium.
In order to achieve the above purpose, the present invention adopts the following technical scheme: the data service type identification method comprises the following steps:
Acquiring metadata;
Constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton;
Acquiring data to be identified;
identifying service types of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton so as to obtain an identification result;
And outputting the identification result.
The further technical scheme is as follows: the construction of the pinyin alphabet, the pinyin full-spelling list and the English AC automaton comprises the following steps:
Loading all service types to obtain a loading result;
Constructing a pinyin alphabet according to the loading result;
constructing a pinyin full-spelling table according to the loading result;
And constructing an English AC automaton according to the English dictionary and the loading result.
The further technical scheme is as follows: the construction of the pinyin alphabet according to the loading result comprises the following steps:
Selecting business terms with the names of Chinese and the number of words of the names not less than four from the loading results to obtain screening results;
splitting the screening result according to each Chinese character to generate corresponding pinyin respectively;
And selecting the initial in the pinyin, and splicing, and reserving the mapping relation between the initial and the service term to obtain a pinyin alphabet.
The further technical scheme is as follows: the English AC automaton comprises a goto table, a fail table and a result table; the goto table is a dictionary tree; the results table is given a state, and whether to output a mode string and a corresponding value is determined according to whether the state is known to correspond to a certain mode string or a certain mode strings; the fail table stores a one-to-one relationship between states, and stores a specified state that should be rolled back after a state transition failure.
The further technical scheme is as follows: the identifying the service type of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton to obtain an identifying result comprises the following steps:
converting the field names in the data to be identified into lowercase so as to obtain a conversion result;
inquiring the corresponding pinyin initial letters from the pinyin alphabet by taking the conversion result as a search condition so as to obtain a first inquiry result;
Judging whether the first query result is that a corresponding pinyin initial exists;
If the first query result is that the corresponding pinyin initial exists, determining that the value corresponding to the pinyin initial is a service type so as to obtain an identification result.
The further technical scheme is as follows: after the judging whether the query result is the corresponding pinyin initial, the method further comprises the following steps:
If the first query result is that the corresponding pinyin initial is not existed, querying the corresponding pinyin from the pinyin full-spelling table by using the conversion result as a search condition so as to obtain a second query result;
judging whether the second query result is that the corresponding pinyin exists or not;
and if the second query result is that the corresponding pinyin exists, determining that the value corresponding to the pinyin is a service type so as to obtain a recognition result.
The further technical scheme is as follows: after the judging whether the second query result is the pinyin with the correspondence, the method further includes:
The conversion result is segmented through an AC automaton according to a bidirectional longest matching algorithm, so that a segmentation result is obtained;
determining a vector for the Chinese corresponding to the word segmentation result to obtain a field vector;
Calculating the similarity between the field vector and the vector corresponding to the loading result;
and screening the loading results with the similarity meeting the requirements to obtain the identification result.
The invention also provides a data service type identification device, which comprises:
A metadata acquisition unit configured to acquire metadata;
The construction unit is used for constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton;
the data acquisition unit to be identified is used for acquiring the data to be identified;
the type identification unit is used for identifying the service type by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton to the data to be identified so as to obtain an identification result;
and the output unit is used for outputting the identification result.
The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.
The present invention also provides a storage medium storing a computer program which, when executed by a processor, implements the above method.
Compared with the prior art, the invention has the beneficial effects that: the invention utilizes the phonetic alphabet, the phonetic full-spelling table and the English AC automaton constructed from the metadata to identify and determine the service type of the data to be identified, thereby improving the identification accuracy and the identification efficiency of the data service type and reducing the labor cost.
The invention is further described below with reference to the drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an application scenario of a data service type identification method according to an embodiment of the present invention;
fig. 2 is a flow chart of a data service type identification method according to an embodiment of the present invention;
fig. 3 is a schematic sub-flowchart of a data service type identification method according to an embodiment of the present invention;
Fig. 4 is a schematic sub-flowchart of a data service type identification method according to an embodiment of the present invention;
Fig. 5 is a schematic sub-flowchart of a data service type identification method according to an embodiment of the present invention;
fig. 6 is a schematic block diagram of a data service type identification device according to an embodiment of the present invention;
Fig. 7 is a schematic block diagram of a construction unit of a data service type identification apparatus according to an embodiment of the present invention;
Fig. 8 is a schematic block diagram of a first construction subunit of the data service type identifying device according to an embodiment of the present invention;
Fig. 9 is a schematic block diagram of a type recognition unit of a data service type recognition device according to an embodiment of the present invention;
fig. 10 is a schematic block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic application scenario diagram of a data service type identification method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a data service type identification method according to an embodiment of the present invention. The data service type identification method is applied to the server. The server performs data interaction with the terminal, a pinyin alphabet, a pinyin full-spelling list and an English AC automaton are constructed by loading all metadata and utilizing the service types known by the metadata, the constructed content is adopted to perform service type identification on the data to be identified, so that an identification result is output, the identification accuracy and the identification efficiency of the data service types are improved, and the labor cost is reduced. For example, a field in a database, i.e., a terminal, is called ysxm, and the server obtains the field through a database connection (jdbc); matching to the "doctor name" service type.
Fig. 2 is a flow chart of a data service type identification method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S150.
S110, acquiring metadata.
In this embodiment, metadata refers to schema, tables, and columns of a database. Wherein the table contains table notes and the columns contain basic information such as column names, column notes, sample data, etc.
S120, constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton.
In this embodiment, the pinyin alphabet refers to a table formed by the mapping relationship between pinyin initials and service terms, and the service terms are service terms of service types.
The spelling whole spelling list refers to a list formed by the mapping relation between spelling whole spelling and business terms; the English AC automaton comprises a goto table, a fail table and a result table; the goto table is a dictionary tree; the results table is given a state, and whether to output a mode string and a corresponding value is determined according to whether the state is known to correspond to a certain mode string or a certain mode strings; the fail table stores a one-to-one relationship between states, and stores a specified state that should be rolled back after a state transition failure.
In one embodiment, referring to fig. 3, the step S120 may include steps S121 to S124.
S121, loading all service types to obtain a loading result.
In this embodiment, the loading result refers to all service types, specifically, the service types represent tags with predefined field meanings.
S122, constructing a pinyin alphabet according to the loading result.
In one embodiment, referring to fig. 4, the step S122 may include steps S1221 to S1223.
S1221, selecting business terms with names of Chinese and words of the names not less than four in the loading result to obtain a screening result.
In this embodiment, the filtering result refers to a service term in the loading result, where the service term name is chinese and the number of words of the name is not less than four.
S1222, splitting the screening result according to each Chinese character to generate corresponding pinyin respectively;
S1223, selecting initial letters in pinyin, splicing, and reserving mapping relation between the initial letters and business terms to obtain a pinyin alphabet.
In particular, the service type name of a part of the service data may be provided with letters, arrays or symbols, where the service term is identified by the pinyin initials, requiring that the service term name be full chinese. The total of the pinyin initials is 23, and only the initials are calculated to be 20, so that the shorter the combination length is, the easier the combination length is repeated through the 20 initials. On the premise of improving the success rate of matching field names with the service term names as much as possible, the situation that the same pinyin initials correspond to a plurality of service term names is reduced, so that the minimum length of the service term names needs to be limited, and the situation that the service term names are limited in length can be effectively reduced but cannot be completely eliminated. Under the consideration of the former two cases, business terms of which names are both Chinese and four words or more are selected. The logic for generating the pinyin initials of the business terms is: the service term names are split according to each Chinese character, corresponding pinyin is generated respectively, initial letters in the pinyin are taken and finally spliced together, the mapping relation between the pinyin initial letters and the service terms is reserved, and the data structure is called a service term pinyin initial list, namely the pinyin initial list. The specific data format is represented by JSON as follows:
{
"ysxm" [ "doctor name", … … ]
……
}。
S123, constructing a pinyin full-spelling table according to the loading result.
In this embodiment, when the pinyin full-spelling table is constructed, the names of the selected service terms must be all chinese. There is no need to limit the number of business term names because the probability of pinyin full-spellings being repeated compared to pinyin initials is relatively small. The logic for generating the pinyin full-spelling list of the business term is as follows: the service term names are split according to each Chinese character, corresponding pinyin is generated respectively and spliced together, the mapping relation between pinyin initial letters and the service terms is reserved, and the data structure is called a service term pinyin full-spelling table, which is called the pinyin full-spelling table for short.
The specific data format is represented by JSON as follows:
{
"yishixingming" [ "doctor name", … … ]
……
}。
S124, constructing an AC automaton according to the English dictionary in combination with the loading result.
Specifically, an AC automaton is constructed from english words according to a built-in english dictionary.
The built-in English dictionary is a key and value structure. The specific data format is represented by JSON as follows:
{
"doctor": "doctor",
"Name": "name"
}。
The AC automaton builds a suffix tree for each node on the prefix tree based on the prefix tree, wherein the word building prefix tree such as name is n-a-m-e, where n is the root node, and the suffix tree is e-m-a-n, where e is the root node. A large number of queries are saved. The AC automaton consists of a goto table, a fail table and a result table. The goto table is a prefix tree, i.e. a dictionary tree. results given a state, it is necessary to know whether the state corresponds to a pattern string or strings, to determine whether to output a pattern string and the corresponding value. The failure table stores a one-to-one relationship between states, and stores the best state that should be rolled back after a state transition failure. The best state refers to the state in which the longest suffix of the string that has been matched can be remembered. The specific data format is represented by JSON as follows:
{"root":{"children":[{"children":[{"children":[{"children":[{"children":[{"childr en":[{"children":[],"failure":{"$ref":"$.root"},"name":"r","results":[{"len":6,"v":"doc tor"}]}],"failure":{"$ref":"$.root"},"name":"o","results":[]}],"failure":{"$ref":"$.root"},"name":"t","results":[]}],"failure":{"$ref":"$.root"},"name":"c","results":[]}],"failu re":{"$ref":"$.root"},"name":"o","results":[]}],"failure":{"$ref":"$.root"},"name":"d","results":[]},{"children":[{"children":[{"children":[{"children":[],"failure":{"$ref":"$.root"},"name":"e","results":[{"len":4,"v":"name"}]}],"failure":{"$ref":"$.root"},"name":"m","results":[]}],"failure":{"$ref":"$.root"},"name":"a","results":[]}],"failure":{"$ref":"$.root"},"name":"n","results":[]}],"name":"","results":[]}}
S130, acquiring data to be identified.
In this embodiment, the data to be identified refers to data that needs to be identified by a service type.
S140, adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton to identify the service type for the data to be identified so as to obtain an identification result.
In this embodiment, the identification result refers to the service type of the data to be identified.
In one embodiment, referring to fig. 5, the step S140 may include steps S141 to S1411.
S141, converting the field names in the data to be identified into lowercase, so as to obtain a conversion result.
In this embodiment, the conversion result refers to a result of converting a field name of data to be recognized into a lowercase letter.
S142, inquiring the corresponding pinyin initial letters from the pinyin alphabet by taking the conversion result as a search condition so as to obtain a first inquiry result;
s143, judging whether the first query result is that a corresponding pinyin initial exists;
S144, if the first query result is that the corresponding pinyin initial exists, determining that the value corresponding to the pinyin initial is a service type, and obtaining an identification result.
Specifically, whether a key is queried in a pinyin alphabet, namely whether the pinyin alphabet corresponding to the conversion result exists or not, if so, taking the first service type of the value corresponding to the pinyin alphabet, and indicating that the service type is successfully identified.
S145, if the first query result is that the corresponding pinyin initial is not existed, querying the corresponding pinyin from the pinyin full-spelling table by using the converted result as a search condition so as to obtain a second query result;
s146, judging whether the second query result is that the corresponding pinyin exists;
and S147, if the second query result is that the corresponding pinyin exists, determining that the value corresponding to the pinyin is a service type so as to obtain an identification result.
Specifically, the initials are more likely to be mismatched, so that the service type name length is limited, and the service type names are all Chinese and are greater than or equal to the service types of four words. However, the full spelling does so. For example: the "identification card" will not appear in the first letter match, but will appear in the full spell.
In this embodiment, whether a key, that is, a pinyin full-spelling exists or not is queried in the pinyin full-spelling table, if so, the first service type of the value corresponding to the pinyin is taken, and if the service type is not recognized in the pinyin initial table, the service type is successfully recognized.
S148, word segmentation is carried out on the conversion result through an AC automaton according to a bidirectional longest matching algorithm, so that a word segmentation result is obtained.
In this embodiment, the word segmentation result refers to a result formed by performing word segmentation on the conversion result.
The word is segmented through the AC automaton according to the bi-directional longest matching algorithm, if the word segmentation can be found in the English dictionary, the word segmentation is successful, otherwise, the step of ending is entered.
Specifically, the word segmentation of the bi-directional longest matching algorithm is a word segmentation algorithm, for example, the username is split into user, name, username. Whereas AC automata is essentially a dictionary, it is here in fact whether user, name, username exist within the dictionary. If three words can be found in the dictionary, then based on the longest match, the username is treated as one word, rather than split into user, name
S149, determining a vector for the Chinese corresponding to the word segmentation result to obtain a field vector.
In this embodiment, the field vector refers to a result obtained by vector determination of chinese corresponding to the word segmentation result.
And (3) removing the Chinese corresponding to the word after word segmentation from the built-in vector dictionary, searching the corresponding vector, and ignoring the Chinese in which the vector is not searched. The built-in vector dictionary has the format:
Name-0.011410 0.118206 0.360038 0.622496-0.183756 0.249871 … …
Doctor 0.236311-0.211533-0.126339 0.352886 0.314047 0.072861 … …
In particular, built-in means that the dictionary is pre-trained and defined in the system. The format of the vector field is: chinese vector 1 vector 2 vector … …. Each vector is a feature representing chinese, which is a description of the chinese. For example, the number of words in Chinese and the total number of strokes in Chinese may be used as features, and this is by way of example only and not by way of illustration.
S1410, calculating the similarity between the field vector and the vector corresponding to the loading result;
S1411, screening the loading results with the similarity meeting the requirements to obtain the identification result.
In this embodiment, all service types are traversed, the service type names are segmented, corresponding vectors are searched, and Chinese in which no vector is found is ignored. The field vector and the vector of the service type are subjected to preliminary similarity calculation, and specifically, the similarity algorithm is cosine similarity. For example, the cosine similarity of a (x 1, y 1) and B (x 1, y 2) is (x1×x2+y1×y2)/((x1×x1+x2×x2) square opening+ (y1×y1+y2) square opening) in a two-dimensional coordinate system; if the similarity threshold is reached and the service type is not identified in the pinyin initial or the pinyin full-spelling identification, the identification is successful.
S150, outputting the identification result.
In this embodiment, when the field has no field annotation and the sample data is difficult to identify by the characteristics of dictionary, regular expression, etc., the service type is identified by spelling the first letter of the field name, spelling all, english translation, and by NLP similarity calculation, etc., which is essentially the identification of the service type for the naming rule of the field name. Thus, the method can achieve good effect, reduce manual intervention, improve the recognition rate and accuracy of fields, and lay a solid foundation for subsequent classification and grading.
The method of the embodiment also identifies the service type as a path through the field name, because the field name also has certain semantics and rationality and has certain relation with the meaning of the field.
According to the data service type identification method, the phonetic alphabet, the phonetic full-spelling list and the English AC automaton constructed from the metadata are utilized to identify and determine the service type of the data to be identified, so that the identification accuracy and the identification efficiency of the data service type are improved, and the labor cost is reduced.
Fig. 6 is a schematic block diagram of a data service type identification apparatus 300 according to an embodiment of the present invention. As shown in fig. 6, the present invention further provides a data service type identifying apparatus 300 corresponding to the above data service type identifying method. The data traffic type recognition device 300 includes a unit for performing the above-described data traffic type recognition method, and may be configured in a server. Specifically, referring to fig. 6, the data service type identifying apparatus 300 includes a metadata acquiring unit 301, a constructing unit 302, a data to be identified acquiring unit 303, a type identifying unit 304, and an output unit 305.
A metadata acquisition unit 301 for acquiring metadata; a construction unit 302, configured to construct a pinyin alphabet, a pinyin full-pinyin table, and an english AC automaton; a data to be identified acquisition unit 303, configured to acquire data to be identified; the type identifying unit 304 is configured to identify a service type by using a pinyin alphabet, a pinyin full-spelling table and an english AC automaton for the data to be identified, so as to obtain an identification result; and an output unit 305 for outputting the identification result.
In one embodiment, as shown in FIG. 7, the build unit 302 includes a type loading subunit 3021, a first build subunit 3022, a second build subunit 3023, and a third build subunit 3024.
A type loading subunit 3021, configured to load all service types to obtain a loading result; a first construction subunit 3022, configured to construct a pinyin alphabet according to the loading result; a second construction subunit 3023, configured to construct a pinyin full-spelling table according to the loading result; and a third construction subunit 3024, configured to construct an english AC automaton according to the english dictionary in combination with the loading result.
In one embodiment, as shown in fig. 8, the first construction subunit 3022 includes a selection module 30221, a splitting module 30222, and a letter splicing module 30223.
A selecting module 30221, configured to select service terms whose service term names are chinese and whose number of words is not less than four in the loading result, so as to obtain a screening result; the splitting module 30222 is configured to split the screening result according to each Chinese character to generate corresponding pinyin respectively; and the letter splicing module 30223 is used for selecting the initial letters in the pinyin and splicing the initial letters, and reserving the mapping relation between the initial letters and the service terms so as to obtain a pinyin alphabet.
In one embodiment, as shown in fig. 9, the type recognition unit 304 includes a conversion subunit 3041, a first query subunit 3042, a first determination subunit 3043, a first determination subunit 3044, a second query subunit 3045, a second determination subunit 3046, a second determination subunit 3047, a word segmentation subunit 3048, a vector determination subunit 3049, a calculation subunit 30410, and a screening subunit 30111.
A converter unit 3041, configured to convert a field name in the data to be identified into a lowercase, so as to obtain a conversion result; a first query subunit 3042, configured to query the conversion result as a search condition from the pinyin-alphabet for a corresponding pinyin initial, so as to obtain a first query result; a first judging subunit 3043, configured to judge whether the first query result is that there is a corresponding pinyin initial; and the first determining subunit 3044 is configured to determine, if the first query result is that the corresponding pinyin initial exists, that a value corresponding to the pinyin initial is a service type, so as to obtain an identification result. A second query subunit 3045, configured to query, if the first query result is that the corresponding pinyin initial does not exist, the conversion result as a search condition from the pinyin full-spelling table for a corresponding pinyin, so as to obtain a second query result; a second judging subunit 3046, configured to judge whether the second query result is that there is a corresponding pinyin; and the second determining subunit 3047 is configured to determine that a value corresponding to the pinyin is a service type if the second query result is that the corresponding pinyin exists, so as to obtain an identification result. The word segmentation subunit 3048 is configured to segment the conversion result according to a bidirectional longest matching algorithm through an AC automaton, so as to obtain a word segmentation result; a vector determining subunit 3049, configured to determine a vector for the chinese corresponding to the word segmentation result, so as to obtain a field vector; a calculating subunit 30410, configured to calculate a similarity between the field vector and a vector corresponding to the loading result; and a screening subunit 30411, configured to screen the loading result with the similarity meeting the requirement, so as to obtain the identification result.
It should be noted that, as will be clearly understood by those skilled in the art, the specific implementation process of the data service type identifying apparatus 300 and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, the description is omitted here.
The above-described data traffic type recognition means 300 may be implemented in the form of a computer program which can be run on a computer device as shown in fig. 10.
Referring to fig. 10, fig. 10 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, where the server may be a stand-alone server or may be a server cluster formed by a plurality of servers.
With reference to FIG. 10, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a data traffic type identification method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a data traffic type identification method.
The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 500 to which the present inventive arrangements may be applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:
Acquiring metadata; constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton; acquiring data to be identified; identifying service types of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton so as to obtain an identification result; and outputting the identification result.
In one embodiment, the processor 502 implements the steps of constructing the pinyin alphabet, the pinyin full-pinyin table, and the english AC automaton, and specifically implements the following steps:
Loading all service types to obtain a loading result; constructing a pinyin alphabet according to the loading result; constructing a pinyin full-spelling table according to the loading result; and constructing an English AC automaton according to the English dictionary and the loading result.
The English AC automaton comprises a goto table, a fail table and a result table; the goto table is a dictionary tree; the results table is given a state, and whether to output a mode string and a corresponding value is determined according to whether the state is known to correspond to a certain mode string or a certain mode strings; the fail table stores a one-to-one relationship between states, and stores a specified state that should be rolled back after a state transition failure.
In one embodiment, when implementing the step of building the pinyin alphabet according to the loading result, the processor 502 specifically implements the following steps: selecting business terms with the names of Chinese and the number of words of the names not less than four from the loading results to obtain screening results; splitting the screening result according to each Chinese character to generate corresponding pinyin respectively; and selecting the initial in the pinyin, and splicing, and reserving the mapping relation between the initial and the service term to obtain a pinyin alphabet.
In an embodiment, when the processor 502 implements the step of identifying the service type by using the pinyin alphabet, the pinyin full-spelling table and the english AC automaton on the data to be identified to obtain an identification result, the following steps are specifically implemented:
Converting the field names in the data to be identified into lowercase so as to obtain a conversion result; inquiring the corresponding pinyin initial letters from the pinyin alphabet by taking the conversion result as a search condition so as to obtain a first inquiry result; judging whether the first query result is that a corresponding pinyin initial exists; if the first query result is that the corresponding pinyin initial exists, determining that the value corresponding to the pinyin initial is a service type so as to obtain an identification result.
In one embodiment, after implementing the step of determining whether the query result is that there is a corresponding pinyin initial, the processor 502 further implements the following steps:
If the first query result is that the corresponding pinyin initial is not existed, querying the corresponding pinyin from the pinyin full-spelling table by using the conversion result as a search condition so as to obtain a second query result; judging whether the second query result is that the corresponding pinyin exists or not; and if the second query result is that the corresponding pinyin exists, determining that the value corresponding to the pinyin is a service type so as to obtain a recognition result.
In one embodiment, after implementing the step of determining whether the second query result is pinyin, the processor 502 further implements the following steps:
The conversion result is segmented through an AC automaton according to a bidirectional longest matching algorithm, so that a segmentation result is obtained; determining a vector for the Chinese corresponding to the word segmentation result to obtain a field vector; calculating the similarity between the field vector and the vector corresponding to the loading result; and screening the loading results with the similarity meeting the requirements to obtain the identification result.
It should be appreciated that in embodiments of the present application, the Processor 502 may be a central processing unit (Central Processing Unit, CPU), the Processor 502 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:
Acquiring metadata; constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton; acquiring data to be identified; identifying service types of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton so as to obtain an identification result; and outputting the identification result.
In one embodiment, when the processor executes the computer program to implement the steps of constructing the pinyin alphabet, the pinyin full-pinyin table and the english AC automaton, the steps are specifically implemented as follows:
Loading all service types to obtain a loading result; constructing a pinyin alphabet according to the loading result; constructing a pinyin full-spelling table according to the loading result; and constructing an English AC automaton according to the English dictionary and the loading result.
The English AC automaton comprises a goto table, a fail table and a result table; the goto table is a dictionary tree; the results table is given a state, and whether to output a mode string and a corresponding value is determined according to whether the state is known to correspond to a certain mode string or a certain mode strings; the fail table stores a one-to-one relationship between states, and stores a specified state that should be rolled back after a state transition failure.
In one embodiment, when the processor executes the computer program to implement the step of building the pinyin alphabet according to the loading result, the processor specifically implements the following steps:
Selecting business terms with the names of Chinese and the number of words of the names not less than four from the loading results to obtain screening results; splitting the screening result according to each Chinese character to generate corresponding pinyin respectively; and selecting the initial in the pinyin, and splicing, and reserving the mapping relation between the initial and the service term to obtain a pinyin alphabet.
In an embodiment, when the processor executes the computer program to implement the step of identifying the service type by using a pinyin alphabet, a pinyin full-spelling table and an english AC automaton on the data to be identified, the method specifically includes the following steps:
Converting the field names in the data to be identified into lowercase so as to obtain a conversion result; inquiring the corresponding pinyin initial letters from the pinyin alphabet by taking the conversion result as a search condition so as to obtain a first inquiry result; judging whether the first query result is that a corresponding pinyin initial exists; if the first query result is that the corresponding pinyin initial exists, determining that the value corresponding to the pinyin initial is a service type so as to obtain an identification result.
In one embodiment, after executing the computer program to implement the step of determining whether the query result is that there is a corresponding pinyin initial, the processor further implements the steps of:
If the first query result is that the corresponding pinyin initial is not existed, querying the corresponding pinyin from the pinyin full-spelling table by using the conversion result as a search condition so as to obtain a second query result; judging whether the second query result is that the corresponding pinyin exists or not; and if the second query result is that the corresponding pinyin exists, determining that the value corresponding to the pinyin is a service type so as to obtain a recognition result.
In one embodiment, after executing the computer program to implement the determining whether the second query result is a pinyin step corresponding to the presence of the second query result, the processor further implements the steps of:
The conversion result is segmented through an AC automaton according to a bidirectional longest matching algorithm, so that a segmentation result is obtained; determining a vector for the Chinese corresponding to the word segmentation result to obtain a field vector; calculating the similarity between the field vector and the vector corresponding to the loading result; and screening the loading results with the similarity meeting the requirements to obtain the identification result.
The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. The data service type identification method is characterized by comprising the following steps:
Acquiring metadata;
Constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton;
Acquiring data to be identified;
identifying service types of the data to be identified by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton so as to obtain an identification result;
And outputting the identification result.
2. The method for recognizing data traffic types according to claim 1, wherein said constructing pinyin alphabet, pinyin full-pinyin table and english AC automaton comprises:
Loading all service types to obtain a loading result;
Constructing a pinyin alphabet according to the loading result;
constructing a pinyin full-spelling table according to the loading result;
And constructing an English AC automaton according to the English dictionary and the loading result.
3. The method for recognizing data traffic type according to claim 2, wherein said constructing a pinyin alphabet according to said loading result comprises:
Selecting business terms with the names of Chinese and the number of words of the names not less than four from the loading results to obtain screening results;
splitting the screening result according to each Chinese character to generate corresponding pinyin respectively;
And selecting the initial in the pinyin, and splicing, and reserving the mapping relation between the initial and the service term to obtain a pinyin alphabet.
4. The method for identifying data traffic type according to claim 2, wherein the english AC automaton includes a goto table, a fail table and a result table; the goto table is a dictionary tree; the results table is given a state, and whether to output a mode string and a corresponding value is determined according to whether the state is known to correspond to a certain mode string or a certain mode strings; the fail table stores a one-to-one relationship between states, and stores a specified state that should be rolled back after a state transition failure.
5. The method for identifying a data service type according to claim 2, wherein the identifying the service type for the data to be identified using a pinyin alphabet, a pinyin full-pinyin table, and an english AC automaton to obtain an identification result includes:
converting the field names in the data to be identified into lowercase so as to obtain a conversion result;
inquiring the corresponding pinyin initial letters from the pinyin alphabet by taking the conversion result as a search condition so as to obtain a first inquiry result;
Judging whether the first query result is that a corresponding pinyin initial exists;
If the first query result is that the corresponding pinyin initial exists, determining that the value corresponding to the pinyin initial is a service type so as to obtain an identification result.
6. The method for recognizing data traffic type according to claim 5, wherein after determining whether the query result is the presence of the corresponding pinyin initials, further comprising:
If the first query result is that the corresponding pinyin initial is not existed, querying the corresponding pinyin from the pinyin full-spelling table by using the conversion result as a search condition so as to obtain a second query result;
judging whether the second query result is that the corresponding pinyin exists or not;
and if the second query result is that the corresponding pinyin exists, determining that the value corresponding to the pinyin is a service type so as to obtain a recognition result.
7. The method for identifying a data service type according to claim 6, wherein after determining whether the second query result is pinyin corresponding to the second query result, further comprising:
The conversion result is segmented through an AC automaton according to a bidirectional longest matching algorithm, so that a segmentation result is obtained;
determining a vector for the Chinese corresponding to the word segmentation result to obtain a field vector;
Calculating the similarity between the field vector and the vector corresponding to the loading result;
and screening the loading results with the similarity meeting the requirements to obtain the identification result.
8. A data traffic type recognition device, comprising:
A metadata acquisition unit configured to acquire metadata;
The construction unit is used for constructing a pinyin alphabet, a pinyin full-spelling list and an English AC automaton;
the data acquisition unit to be identified is used for acquiring the data to be identified;
the type identification unit is used for identifying the service type by adopting a pinyin alphabet, a pinyin full-spelling list and an English AC automaton to the data to be identified so as to obtain an identification result;
and the output unit is used for outputting the identification result.
9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.
10. A storage medium storing a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.
CN202311823350.3A 2023-12-27 2023-12-27 Data service type identification method, device, computer equipment and storage medium Pending CN117952092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311823350.3A CN117952092A (en) 2023-12-27 2023-12-27 Data service type identification method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311823350.3A CN117952092A (en) 2023-12-27 2023-12-27 Data service type identification method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117952092A true CN117952092A (en) 2024-04-30

Family

ID=90797081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311823350.3A Pending CN117952092A (en) 2023-12-27 2023-12-27 Data service type identification method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117952092A (en)

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
KR102532396B1 (en) Data set processing method, device, electronic equipment and storage medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
US20220129448A1 (en) Intelligent dialogue method and apparatus, and storage medium
CN112115232A (en) Data error correction method and device and server
CN110969517B (en) Bidding life cycle association method, system, storage medium and computer equipment
CN112925883B (en) Search request processing method and device, electronic equipment and readable storage medium
CN113128209B (en) Method and device for generating word stock
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN112988753B (en) Data searching method and device
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN114495143A (en) Text object identification method and device, electronic equipment and storage medium
CN116842951A (en) Named entity recognition method, named entity recognition device, electronic equipment and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN114547301A (en) Document processing method, document processing device, recognition model training equipment and storage medium
CN112988784B (en) Data query method, query statement generation method and device
CN113408660A (en) Book clustering method, device, equipment and storage medium
US20030126138A1 (en) Computer-implemented column mapping system and method
CN113076758A (en) Task-oriented dialog-oriented multi-domain request type intention identification method
CN110309258B (en) Input checking method, server and computer readable storage medium
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
CN117952092A (en) Data service type identification method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination