CN113177407A - Data dictionary construction method and device, computer equipment and storage medium - Google Patents

Data dictionary construction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113177407A
CN113177407A CN202110737090.2A CN202110737090A CN113177407A CN 113177407 A CN113177407 A CN 113177407A CN 202110737090 A CN202110737090 A CN 202110737090A CN 113177407 A CN113177407 A CN 113177407A
Authority
CN
China
Prior art keywords
basic
data dictionary
participles
word segmentation
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110737090.2A
Other languages
Chinese (zh)
Inventor
彭康康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202110737090.2A priority Critical patent/CN113177407A/en
Publication of CN113177407A publication Critical patent/CN113177407A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application relates to the technical field of data processing, and discloses a method, a device, computer equipment and a storage medium for constructing a data dictionary, wherein the method comprises the steps of obtaining a data dictionary generation request of a user side, carrying out word segmentation processing on sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as a noun to be used as basic word segmentation; carrying out synonym combination on the basic participles based on the similarity values of any two basic participles, and screening out the basic participles with the association relationship to serve as association participles; and then determining the dependency relationship of the associated participles, filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary, returning to the modified basic data dictionary, and determining a target data dictionary through comparison of matching results. The present application also relates to blockchain techniques, with sample data stored in the blockchain. According to the method and the device, accurate extraction of the data is avoided, and the data accuracy of the data dictionary is improved.

Description

Data dictionary construction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for constructing a data dictionary, a computer device, and a storage medium.
Background
A Data dictionary (Data dictionary), which generally refers to a collection of definitions and descriptions of Data items, Data structures, Data streams, Data stores, etc. of Data, is a collection of descriptions of Data objects or items in a Data model. A data dictionary is often maintained for one type of data, and the data storage structure is set accordingly based on the type, size, and the like of the data. For example, a data dictionary for storing telephone numbers and a data dictionary for storing entity information may respectively adopt different data storage structures set correspondingly.
The table names and table fields of the existing data dictionary are often defined by developers according to business scenes, personal experience, personal habits and English levels (the table names are generally English names), so that the names of the table fields are not standard and uniform, the fields representing the same meaning can be named multiple times, and the data dictionary is also easily modified randomly, so that the data dictionary aiming at the same data sample is often provided with multiple versions, data errors are easily caused, and the generated data dictionary data is inaccurate. There is a need for a method that can improve the data accuracy of a data dictionary.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for constructing a data dictionary, computer equipment and a storage medium, so as to improve the data accuracy of the data dictionary.
In order to solve the above technical problem, an embodiment of the present application provides a method for constructing a data dictionary, including:
acquiring a data dictionary generation request from a user side, wherein the data dictionary generation request comprises sample data selected by the user side;
performing word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as nouns from the initial word segmentation to be used as basic word segmentation;
based on the similarity values of any two basic participles, carrying out synonym combination on the basic participles, and screening out the basic participles with the association relationship to serve as association participles;
determining the dependency relationship of the associated participles, and filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary;
returning the basic data dictionary to the user side to obtain a modified basic data dictionary returned by the user side;
matching the basic data dictionary with the modified basic data dictionary to obtain a matching result;
and if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched, taking the modified basic data dictionary as a target data dictionary.
In order to solve the foregoing technical problem, an embodiment of the present application provides a data dictionary constructing apparatus, including:
the data dictionary generation request module is used for acquiring a data dictionary generation request from a user side, wherein the data dictionary generation request comprises sample data selected by the user side;
the basic word segmentation extraction module is used for carrying out word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as nouns from the initial word segmentation to serve as basic word segmentation;
the relation participle screening module is used for carrying out synonym combination on the basic participles based on the similarity values of any two basic participles, screening out the basic participles with the association relation and taking the basic participles as association participles;
a basic data dictionary obtaining module, configured to determine a dependency relationship of the associated participles, and fill the associated participles into a preset data dictionary according to the dependency relationship, so as to obtain a basic data dictionary;
the basic data dictionary returning module is used for returning the basic data dictionary to the user side so as to obtain the modified basic data dictionary returned by the user side;
the matching result generation module is used for matching the basic data dictionary with the modified basic data dictionary to obtain a matching result;
and the target data dictionary determining module is used for taking the modified basic data dictionary as the target data dictionary if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched.
In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided that includes, one or more processors; a memory for storing one or more programs for causing the one or more processors to implement the method of constructing a data dictionary as described in any one of the above.
In order to solve the technical problems, the invention adopts a technical scheme that: a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of constructing a data dictionary as set forth in any one of the above.
The embodiment of the invention provides a method and a device for constructing a data dictionary, computer equipment and a storage medium. According to the embodiment of the invention, the data dictionary generation request of the user side is acquired, the word segmentation processing is carried out on the sample data to obtain the initial word segmentation, and the initial word segmentation with the part of speech being the noun is extracted to be used as the basic word segmentation, so that the noun in the sample is accurately extracted, and the subsequent analysis of the associated word segmentation is facilitated; based on the similarity values of any two basic participles, synonym combination is carried out on the basic participles, the basic participles with the association relation are screened out and used as the association participles, the relation among the basic participles is obtained, and the data accuracy is improved; and then determining the dependency relationship of the associated participles, filling the associated participles into the preset data dictionary according to the dependency relationship to obtain a basic data dictionary, returning to the modified basic data dictionary, judging whether a source file of the basic data dictionary is modified or not through comparison of matching results, and determining a target data dictionary.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is a schematic application environment diagram of a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 2 is a flow chart of an implementation of a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 3 is a flowchart of an implementation of a sub-process in a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 4 is a flowchart of another implementation of a sub-process in a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 5 is a flowchart of another implementation of a sub-process in a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 6 is a flowchart of another implementation of a sub-process in a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 7 is a flowchart of another implementation of a sub-process in a method for constructing a data dictionary according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data dictionary construction device provided in an embodiment of the present application;
fig. 9 is a schematic diagram of a computer device provided in an embodiment of the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
Referring to fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a search-type application, an instant messaging tool, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the method for constructing the data dictionary provided in the embodiment of the present application is generally executed by a server, and accordingly, the apparatus for constructing the data dictionary is generally configured in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring to fig. 2, fig. 2 shows a specific embodiment of a method for constructing a data dictionary.
It should be noted that, if the result is substantially the same, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:
s1: and acquiring a data dictionary generation request from the user side, wherein the data dictionary generation request comprises sample data selected by the user side.
In the embodiments of the present application, in order to more clearly understand the technical solution, the following detailed description is made on the terminal related to the present application.
The server can receive a data dictionary generation request of a user side, sample data selected by the user side exists in the data dictionary generation request, the server extracts corresponding sample data from a database, analyzes the sample data, generates a basic data dictionary and returns the basic data dictionary to the user side. And then receiving the modified basic data dictionary returned by the user side, and judging whether the source file information of the basic data dictionary is modified or not, thereby confirming the final target data dictionary. The server may also return the validated target data dictionary to the user side.
And secondly, the user side can select a certain sample data and data dictionary generation request, send the request to the server, receive the basic data dictionary returned by the server, modify the basic data dictionary and return the modified data dictionary to the server. The user terminal can also receive the target data dictionary confirmed by the server.
Specifically, after receiving a data dictionary generation request from a user side, the server extracts corresponding sample data from a corresponding database according to the sample data selected by the user side in the data field generation request, thereby performing data dictionary generation analysis. The sample data is service data in a certain field selected by the user terminal. For example, the user terminal selects data corresponding to the merchant.
S2: and performing word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as nouns from the initial word segmentation to be used as basic word segmentation.
Specifically, the present embodiment aims to extract each noun in sample data, so that word segmentation processing needs to be performed on the sample data to obtain an initial word segmentation, and part-of-speech tagging is performed on the initial word segmentation, so as to screen out a word segmentation whose part-of-speech is a noun. Word segmentation approaches include, but are not limited to: jieba participles, viterbi algorithm participles, etc.; preferably, the word segmentation is carried out in a Jieba word segmentation mode, which is convenient for subsequent part-of-speech tagging so as to facilitate the extraction of initial word segmentation.
Referring to fig. 3, fig. 3 shows an embodiment of step S2, which is described in detail as follows:
s21: and carrying out data cleaning on the sample data to obtain basic data.
Specifically, because the sample data in the database is a collection of data oriented to a certain subject, the sample data is extracted from a plurality of business systems and contains historical data, so that the problems that some sample data is wrong data and some data conflicts with each other are avoided, the wrong or conflicting data is obviously not satisfactory, and the task of data cleaning is to filter the unsatisfactory data are solved. Among them, Data cleansing (Data cleansing) is a process of rechecking and verifying Data, aiming at deleting duplicate information, correcting existing errors, and providing Data consistency.
S22: and performing word segmentation processing on the basic data by adopting a Jieba word segmentation mode to obtain initial word segmentation.
Specifically, the Jieba word segmentation is a Chinese open source word segmentation packet, has the characteristics of high performance, accuracy, expandability and the like, mainly supports python at present, has related versions in other languages, can realize fast word segmentation of a text, is favorable for subsequent part-of-speech tagging so as to facilitate subsequent basic word segmentation extraction, and therefore word segmentation processing is performed on basic data by adopting a Jieba word segmentation mode.
S23: and performing part-of-speech tagging on the initial participle to obtain tagged participles, and screening the tagged participles with parts-of-speech as names from the tagged participles to serve as basic participles.
Specifically, the tagged participles are obtained by performing part-of-speech tagging on each initial participle, and each tagged participle has the part-of-speech. The part of speech comprises real words and imaginary words, wherein the real words comprise words representing real meanings, and the real words comprise nouns, verbs, adjectives, numerators, quantifiers, pronouns, status words and distinguishing words; the term "virtual word" means a word not representing the actual meaning but representing the grammatical meaning, and includes adverb, preposition, conjunctive, auxiliary word, sigh word and pseudonym. Specifically, the particle word does not represent the actual meaning but represents the grammatical meaning, so that the part of the particle word does not need to be selected, and the data dictionary only needs to segment the nouns in the particle word, and after the particle word part is removed, the labeled particle word with the part of speech as the noun is selected as the basic keyword.
In the embodiment, the basic data is obtained by performing data cleaning on the sample data, the word segmentation processing is performed on the basic data in a Jieba word segmentation mode to obtain the initial word segmentation, the part of speech tagging is performed on the initial word segmentation to obtain the tagged word segmentation, the tagged word segmentation with the part of speech as the name is screened out from the tagged word segmentation and is used as the basic word segmentation, the basic word segmentation is extracted from the sample data, the subsequent confirmation of the associated word segmentation is facilitated, and therefore the data accuracy of the data dictionary is improved.
S3: and based on the similarity values of any two basic participles, carrying out synonym combination on the basic participles, and screening out the basic participles with the association relationship to serve as the association participles.
Specifically, the basic participles may have the same meaning, but the used characters are different, so that different basic participles exist in the same participle, and synonym combination needs to be performed on the basic participles to avoid data repetition and improve the data accuracy of the data dictionary. Therefore, the similarity value of any two basic participles is calculated in a preset similarity value calculation mode, and whether any two basic participles are synonyms or not is judged by judging that the similarity value is in a preset interval. Similarly, whether any two basic participles have the association relationship is judged by calculating the similarity value of any two basic participles.
Wherein, the calculation of the similarity value includes but is not limited to: minkowski Distance (Minkowski Distance), Manhattan Distance (Manhattan Distance), Euclidean Distance (Euclidean Distance), cosine similarity, hamming Distance, and the like.
It should be noted that the dimension of the distance calculation similarity is not uniform, and the specific similarity result needs to be mapped to the [0,1] interval, so that the larger the similarity value is, the more similar the two basic participles are, and the more likely the participles have the same meaning.
Referring to fig. 4, fig. 4 shows an embodiment of step S3, which is described in detail as follows:
s31: and combining the basic participles in pairs to obtain a basic participle combination.
Specifically, any basic participles are combined pairwise, so that all basic participles can form basic participle combinations with other basic participles, and the subsequent judgment of whether any two basic participles are synonyms or not or whether any two basic participles are related participles or not is facilitated.
S32: and calculating the similarity value of two basic participles in each basic participle combination by a preset similarity calculation mode to obtain a target similarity value.
Specifically, the calculation of the similarity value includes, but is not limited to: minkowski Distance (Minkowski Distance), Manhattan Distance (Manhattan Distance), Euclidean Distance (Euclidean Distance), cosine similarity, hamming Distance, and the like. In the embodiment of the present application, since the euclidean distance has advantages of simple calculation and high calculation speed, the euclidean distance is preferable as the calculation method of the similarity value. Among them, euclidean distance, also called euclidean metric, is a commonly used distance definition, which is the true distance between two points in an m-dimensional space.
S33: and if the target similarity value is within the first preset interval, judging that the two basic participles corresponding to the target similarity value are synonyms, and merging the two basic participles which are the synonyms.
S34: and if the target similarity value is within a second preset interval, judging that the two basic participles corresponding to the target similarity value have an association relation, and taking the two basic participles with the association relation as association participles.
Specifically, when the target similarity value is higher, that is, the target similarity value is within the first preset interval, it indicates that the two basic participles are highly similar, and thus the two basic participles are determined as synonyms and are combined, that is, one of the participles is retained. When the target similarity value is in a second preset interval, the two basic participles are similar, and an association relationship exists, so that the two basic participles are used as association participles. And if the target similarity values are not in the first preset interval and the second interval, the basic participle combination is abandoned, and the target similarity values are not in the first preset interval and the second preset interval, and the basic participle combination is not the synonym or has the association relation.
For example, two basic participles in a basic participle combination are a restaurant and a restaurant respectively, the similarity value of the restaurant and the restaurant is 0.98 through calculation of the similarity value, the similarity value is in a first preset interval [0.9,1], if the similarity value is judged to be a synonym, one basic participle in the basic participle combination is reserved. For example, two basic participles in a basic participle combination are a restaurant and a restaurant respectively, the similarity value of the restaurant and the restaurant is 0.66 through calculation of the similarity value, and the similarity value is within a second preset interval [0.5,0.9 ], and then the two basic participles are judged to have the association relationship and are taken as the association participles.
It should be noted that the first preset interval and the second preset interval are set according to actual conditions, and the first preset interval is greater than the second preset interval. In one embodiment, the first predetermined interval is [0.9,1], and the second predetermined interval is [0.5,0.9 ].
In this embodiment, the basic participles are combined pairwise to obtain the basic participle combination, the similarity value of the basic participles of the basic participle combination is calculated, and the relationship between the basic participles is determined by determining which preset interval the similarity value falls in, so that the data processing accuracy is improved.
S4: and determining the dependency relationship of the associated participles, and filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary.
Specifically, in the basic participles having an association relationship, if the same basic participle has an association relationship with a plurality of other basic participles, it is determined that the basic participle and the other basic participles have a subordinate relationship. For example, the basic word includes a meal, a restaurant, a business surpass, an accommodation and a fresh food, the meal and the restaurant have an association relationship, the meal and the business surpass have an association relationship, the meal and the accommodation have an association relationship, the meal and the fresh food have an association relationship, and then the meal, the business surpass, the accommodation and the fresh food and the meal are in a subordinate relationship. And if the subordination relation of the associated participles is determined, identifying the corresponding data items of the preset data dictionary, and filling the associated participles into the preset data dictionary to obtain a basic data dictionary.
Referring to fig. 5, fig. 5 shows an embodiment of step S4, which is described in detail as follows:
s41: in the related participles, if the same basic participle has a related relationship with other multiple basic participles, determining the subordinate relationship between the same basic participle and other multiple basic participles.
Specifically, whether the basic participle and other multiple basic participles have a dependency relationship is determined by judging whether the same basic participle and other multiple basic participles have an association relationship. By determining the dependency relationship of the basic participles, the basic participles can be conveniently filled into the data dictionary according to the dependency relationship, and the data accuracy of the data dictionary can be improved.
S42: data items of a preset data dictionary are identified.
S43: and filling the associated participles into the data items corresponding to the preset data dictionary according to the dependency relationship to obtain a basic data dictionary.
Specifically, since the preset data dictionary is created in advance before step S4, the associated participles with the dependency relationship are filled in the corresponding data items according to the rules of the preset data dictionary. Therefore, the associated participles are filled into the data items corresponding to the preset data dictionary according to the dependency relationship by identifying the corresponding data items in the preset data dictionary to obtain the basic data dictionary.
In the implementation, in the associated participles, if the same basic participle has an associated relationship with other multiple basic participles, determining a subordinate relationship between the same basic participle and other multiple basic participles, identifying a data item of a preset data dictionary, and filling the associated participles into the data item corresponding to the preset data dictionary according to the subordinate relationship to obtain the basic data dictionary.
Further, in a specific implementation manner before step S4, this embodiment includes:
and creating a preset data dictionary according to a unified preset rule.
Specifically, the data dictionary is established in advance before the basic data dictionary is created. The fields are named by adopting a unified preset rule in the preset data dictionary, wherein the unified preset rule comprises a naming rule, a format, field lengths and the like, and the preset data dictionary has the functions of inquiry, addition, modification, deletion and the like. Because all database table fields are uniformly maintained in the data dictionary, the uniform standard of field naming is ensured, the condition that a plurality of field names in the same field or different naming styles (such as hump type, underline division and middle-line division) or different field lengths exist can not occur, so that the condition that all system database naming adopts uniform source data can be ensured, the condition that the name field is named by the A system and the name field is named by the B system as the Username can not occur.
In this embodiment, the preset data dictionary is created according to the unified preset rule, so that the data dictionary can be maintained according to the unified rule, data is prevented from being modified randomly, and the data accuracy of the data dictionary is improved.
S5: and returning the basic data dictionary to the user side so as to obtain the modified basic data dictionary returned by the user side.
Specifically, the server returns the created basic data dictionary to the user side, the user side fills and modifies the content of the basic data dictionary according to the actual situation after receiving the corresponding basic data dictionary, and after the processing is completed, the modified basic data dictionary is returned to the server.
S6: and matching the basic data dictionary with the modified basic data dictionary to obtain a matching result.
Specifically, in order to prevent the user side from not filling data according to the preset format of the basic data dictionary, data is modified randomly, so that the source file of the basic data dictionary is damaged, and the data filled randomly may exist, so that the data accuracy of the generated target basic dictionary is reduced. Therefore, it is necessary to determine whether the modified base dictionary returned by the user side has the source file damaged. And matching the basic data dictionary with the modified basic data dictionary to judge whether the source file is damaged or not.
Further, the MD5 values of the basic data dictionary and the modified basic data dictionary are matched to obtain a matching result.
MD5 refers to MD5 Message Digest Algorithm (english: MD5 Message-Digest Algorithm), a widely used cryptographic hash function, which can generate a 128-bit (16-byte) hash value (hash value) to ensure the integrity of Message transmission.
Referring to fig. 6, fig. 6 shows an embodiment of step S6, which is described in detail as follows:
s61: the MD5 value of the base data dictionary is calculated as the first matching value.
S62: the MD5 value of the modified base data dictionary is calculated as the second matching value.
Whether source file information of the base data dictionary is modified is judged by calculating the first matching value and the second matching value. The manner of calculating the value of MD5 includes, but is not limited to, pre-set calculation tools (e.g., MiniMD5_ v1.1.exe and qlqmd 5. exe) and Windows command line calculations (e.g., certifil-hashfile function).
S63: and performing character string matching on the first matching value and the second matching value in a character string matching mode to obtain a matching result.
Specifically, the method detects whether the first matching value is completely consistent with the second matching value by comparing the character strings of the first matching value and the second matching value one by one, so as to judge whether the source file of the basic data dictionary is modified.
The string matching algorithm includes but is not limited to: storm algorithm (Brute Force), Hash search (Robin-Karp), Kent-Morris-Pratid operation (KMP algorithm for short), Boyer-Moore algorithm (BM algorithm for short), Sunday algorithm, etc.
Preferably, the embodiment of the application adopts a Boyer-Moore algorithm to match the character strings, and the BM algorithm can skip more than one character at a time when the character strings do not match. I.e. it does not need to compare characters in the searched string one by one, but rather some parts of it are skipped. Generally, the longer the search key, the faster the algorithm. Its efficiency comes from the fact that: for each failed match attempt, the algorithm can use this information to exclude as many unmatched locations as possible. Namely, the method fully utilizes some characteristics of the character string to be searched, and accelerates the matching step.
In the embodiment, whether the source file of the basic data dictionary is modified or not is judged by calculating the first matching value and the second matching value and combining a character string matching mode, so that data in the data dictionary is prevented from being modified and damaged randomly, and the data accuracy of the data dictionary is improved.
Referring to fig. 7, fig. 7 shows an embodiment of step S63, which is described in detail as follows:
s631: and aligning one end of the character string of the first matching value and the second matching value.
Specifically, one end of the character string of the first matching value is aligned with one end of the character string of the second matching value, so that the characters of the first matching value and the second matching value can be matched one by one conveniently.
S632: and matching the first matching value with the character string of the second matching value one by one from one end.
S633: and if the first character is successfully matched, continuing to match the subsequent characters until all characters are matched.
Specifically, the first matching value is matched with the character string of the second matching value one by one from one end, if the matching failure exists in the characters, the matching process can be directly interrupted, the matching failure can be directly judged, and the matching can be continued until the matching is completed. If the first character is successfully matched, continuing to match the subsequent characters until all characters are matched.
S634: if at least one character cannot be matched, the matching result is that the MD5 values of the basic data dictionary and the modified basic data dictionary fail to be matched.
S635: and if all the characters are successfully matched, the MD5 values of the matching result basic data dictionary and the modified basic data dictionary are successfully matched.
Specifically, if at least one character cannot be matched, it indicates that the first matching value and the second matching value have different characters, that is, it indicates that the first matching value and the second matching value do not belong to the same character string, it is determined that the MD5 values of the basic data dictionary and the modified basic data dictionary have failed to be matched. Only if all characters match successfully, the MD5 values of the base data dictionary and the modified base data dictionary are determined to match successfully.
In the implementation, one end of the character string of the first matching value is aligned with one end of the character string of the second matching value, the first matching value is matched with the character string of the second matching value one by one from one end of the character string, the matching result is obtained, and whether the MD5 values of the basic data dictionary and the modified basic data dictionary are successfully matched or not is judged from the matching result, so that whether the source file of the basic data dictionary is modified or not is judged, data in the data dictionary is prevented from being modified and damaged randomly, and the data accuracy of the data dictionary is improved.
S7: and if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched, taking the modified basic data dictionary as the target data dictionary.
Specifically, if the matching result is that the MD5 of the basic data dictionary and the modified basic data dictionary are successfully matched, it indicates that the user side fills the basic data dictionary strictly according to the preset rule, so the modified basic data dictionary is used as the target data dictionary.
In the embodiment, a data dictionary generation request of a user side is obtained, word segmentation processing is performed on sample data to obtain initial word segmentation, and initial word segmentation with part of speech being nouns is extracted to be used as basic word segmentation, so that nouns in the sample are accurately extracted, and subsequent analysis of associated word segmentation is facilitated; based on the similarity values of any two basic participles, synonym combination is carried out on the basic participles, the basic participles with the association relation are screened out and used as the association participles, the relation among the basic participles is obtained, and the data accuracy is improved; and then determining the dependency relationship of the associated participles, filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary, returning to the modified basic data dictionary, judging whether a source file of the basic data dictionary is modified or not through comparing MD5 values, and determining a target data dictionary.
It is emphasized that, in order to further ensure the privacy and security of the sample data, the sample data may also be stored in a node of a block chain.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
Referring to fig. 8, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a device for constructing a data dictionary, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices.
As shown in fig. 8, the data dictionary constructing apparatus of the present embodiment includes: a data dictionary generation request module 81, a basic participle extraction module 82, an associated participle screening module 83, a basic data dictionary acquisition module 84, a basic data dictionary returning module 85, a matching result generation module 86, and a target data dictionary determination module 87, wherein:
the data dictionary generation request module 81 is configured to obtain a data dictionary generation request from a user side, where the data dictionary generation request includes sample data selected by the user side;
the basic word segmentation extracting module 82 is configured to perform word segmentation processing on the sample data to obtain initial word segmentation, and extract an initial word segmentation with a part of speech being a noun from the initial word segmentation as a basic word segmentation;
the related participle screening module 83 is configured to perform synonym combination on the basic participles based on similarity values of any two basic participles, and screen out basic participles having a related relationship as related participles;
a basic data dictionary obtaining module 84, configured to determine a dependency relationship of the associated participles, and fill the associated participles into a preset data dictionary according to the dependency relationship, so as to obtain a basic data dictionary;
the basic data dictionary returning module 85 is used for returning the basic data dictionary to the user side so as to obtain the modified basic data dictionary returned by the user side;
a matching result generation module 86, configured to obtain a matching result by matching the basic data dictionary with the modified basic data dictionary;
and the target data dictionary determining module 87 is configured to, if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched, use the modified basic data dictionary as the target data dictionary.
Further, the basic participle extracting module 82 includes:
the basic data acquisition unit is used for carrying out data cleaning on the sample data to obtain basic data;
the initial word segmentation acquisition unit is used for performing word segmentation processing on the basic data in a Jieba word segmentation mode to obtain initial words;
and the basic participle obtaining unit is used for performing part-of-speech tagging on the initial participle to obtain tagged participles, and screening the tagged participles with the part-of-speech as names from the tagged participles to serve as basic participles.
Further, the related word segmentation screening module 83 includes:
the basic word segmentation combination unit is used for combining every two basic word segmentations to obtain a basic word segmentation combination;
the target similarity value calculation unit is used for calculating the similarity value of two basic participles in each basic participle combination in a preset similarity calculation mode to obtain a target similarity value;
the basic participle merging unit is used for judging that two basic participles corresponding to the target similarity value are synonyms and merging the two basic participles which are the synonyms if the target similarity value is within a first preset interval;
and the associated participle acquiring unit is used for judging that two basic participles corresponding to the target similarity value have an association relationship if the target similarity value is within a second preset interval, and taking the two basic participles with the association relationship as associated participles.
Further, the basic data dictionary obtaining module 84 includes:
the subordinate relation confirming unit is used for confirming the subordinate relation between the same basic participle and other multiple basic participles if the same basic participle has the related relation with other multiple basic participles in the related participles;
a data item identification unit for identifying a data item of a preset data dictionary;
and the associated participle filling unit is used for filling the associated participles into the data items corresponding to the preset data dictionary according to the dependency relationship to obtain the basic data dictionary.
Further, before the basic data dictionary obtaining module 84, the following are also included:
and the preset data dictionary creating module is used for creating the preset data dictionary according to a unified preset rule.
Further, the matching result generating module 86 includes:
a first matching value calculating unit for calculating an MD5 value of the basic data dictionary as a first matching value;
a second matching value calculation unit for calculating the MD5 value of the modified basic data dictionary as a second matching value;
and the character string matching unit is used for performing character string matching on the first matching value and the second matching value in a character string matching mode to obtain a matching result.
Further, the character string matching unit includes:
a character string alignment subunit, configured to align one end of a character string of the first matching value and the second matching value;
the character matching subunit is used for matching the first matching value with the character string of the second matching value one by one from one end;
a character matching completion subunit, configured to, if the first character is successfully matched, continue to match subsequent characters until all characters are matched;
a first matching result word unit, configured to, if there is at least one character that cannot be matched, determine that the matching result is that the MD5 values of the basic data dictionary and the modified basic data dictionary fail to be matched;
and the second matching result word unit is used for successfully matching the MD5 values of the matching result basic data dictionary and the modified basic data dictionary if all the characters are successfully matched.
It is emphasized that, in order to further ensure the privacy and security of the sample data, the sample data may also be stored in a node of a block chain.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 9, fig. 9 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 9 includes a memory 91, a processor 92, and a network interface 93 communicatively connected to each other via a system bus. It is noted that only the computer device 9 having three components memory 91, processor 92, network interface 93 is shown, but it is understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 91 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 91 may be an internal storage unit of the computer device 9, such as a hard disk or a memory of the computer device 9. In other embodiments, the memory 91 may also be an external storage device of the computer device 9, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 9. Of course, the memory 91 may also comprise both an internal storage unit of the computer device 9 and an external storage device thereof. In this embodiment, the memory 91 is generally used for storing an operating system installed in the computer device 9 and various types of application software, such as program codes of a construction method of a data dictionary. Further, the memory 91 can also be used to temporarily store various types of data that have been output or are to be output.
Processor 92 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 92 is typically used to control the overall operation of the computer device 9. In this embodiment, the processor 92 is configured to execute the program code stored in the memory 91 or process data, for example, the program code of the above-mentioned construction method of the data dictionary, so as to implement various embodiments of the construction method of the data dictionary.
The network interface 93 may include a wireless network interface or a wired network interface, and the network interface 93 is generally used to establish a communication connection between the computer device 9 and other electronic devices.
The present application further provides another embodiment, that is, a computer-readable storage medium is provided, and the computer-readable storage medium stores a computer program, which can be executed by at least one processor, so as to make the at least one processor execute the steps of the method for constructing a data dictionary.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A method for constructing a data dictionary is characterized by comprising the following steps:
acquiring a data dictionary generation request from a user side, wherein the data dictionary generation request comprises sample data selected by the user side;
performing word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as nouns from the initial word segmentation to be used as basic word segmentation;
based on the similarity values of any two basic participles, carrying out synonym combination on the basic participles, and screening out the basic participles with the association relationship to serve as association participles;
determining the dependency relationship of the associated participles, and filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary;
returning the basic data dictionary to the user side to obtain a modified basic data dictionary returned by the user side;
matching the basic data dictionary with the modified basic data dictionary to obtain a matching result;
and if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched, taking the modified basic data dictionary as a target data dictionary.
2. The method according to claim 1, wherein the performing word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation whose part of speech is a noun from the initial word segmentation as basic word segmentation includes:
carrying out data cleaning on the sample data to obtain basic data;
performing word segmentation processing on the basic data by adopting a Jieba word segmentation mode to obtain the initial word segmentation;
and performing part-of-speech tagging on the initial participle to obtain tagged participles, and screening the tagged participles with parts-of-speech as names from the tagged participles to serve as the basic participles.
3. The method according to claim 1, wherein the method for constructing the data dictionary includes, based on similarity values of any two of the basic participles, performing synonym combination on the basic participles, and screening out basic participles having an association relationship as associated participles:
combining the basic participles pairwise to obtain basic participle combinations;
calculating the similarity value of two basic participles in each basic participle combination by a preset similarity calculation mode to obtain a target similarity value;
if the target similarity value is within a first preset interval, judging that two basic participles corresponding to the target similarity value are synonyms, and merging the two basic participles which are the synonyms;
and if the target similarity value is within a second preset interval, judging that the two basic participles corresponding to the target similarity value have an association relation, and taking the two basic participles with the association relation as association participles.
4. The method for constructing a data dictionary according to claim 1, wherein the determining the dependency relationship of the associated participles and filling the associated participles into a preset data dictionary according to the dependency relationship to obtain a basic data dictionary comprises:
in the associated participles, if the same basic participle has an associated relationship with other multiple basic participles, determining the subordinate relationship between the same basic participle and other multiple basic participles;
identifying data items of the preset data dictionary;
and filling the associated participles into data items corresponding to a preset data dictionary according to the dependency relationship to obtain a basic data dictionary.
5. The method for constructing a data dictionary according to claim 1, wherein before determining the dependency of the associated participles and filling the associated participles into a preset data dictionary according to the dependency to obtain a basic data dictionary, the method further comprises:
and creating the preset data dictionary according to a unified preset rule.
6. The method for constructing a data dictionary according to any one of claims 1 to 5, wherein the obtaining of the matching result by matching the basic data dictionary and the modified basic data dictionary comprises:
calculating an MD5 value of the basic data dictionary as a first matching value;
calculating the MD5 value of the modified basic data dictionary as a second matching value;
and performing character string matching on the first matching value and the second matching value in a character string matching mode to obtain a matching result.
7. The method for constructing a data dictionary according to claim 6, wherein the performing a string matching on the first matching value and the second matching value in a string matching manner to obtain a matching result comprises:
aligning one end of the character string of the first matching value and the second matching value;
matching the first matching value with the character strings of the second matching value one by one from one end;
if the first character is successfully matched, continuing to match the subsequent characters until all characters are matched;
if at least one character cannot be matched, the matching result is that the MD5 values of the basic data dictionary and the modified basic data dictionary fail to be matched;
and if all the characters are successfully matched, the MD5 values of the basic data dictionary and the modified basic data dictionary are successfully matched according to the matching result.
8. An apparatus for constructing a data dictionary, comprising:
the data dictionary generation request module is used for acquiring a data dictionary generation request from a user side, wherein the data dictionary generation request comprises sample data selected by the user side;
the basic word segmentation extraction module is used for carrying out word segmentation processing on the sample data to obtain initial word segmentation, and extracting initial word segmentation with part of speech as nouns from the initial word segmentation to serve as basic word segmentation;
the related participle screening module is used for carrying out synonym combination on the basic participles based on the similarity values of any two basic participles, screening out the basic participles with the related relation and taking the basic participles as related participles;
a basic data dictionary obtaining module, configured to determine a dependency relationship of the associated participles, and fill the associated participles into a preset data dictionary according to the dependency relationship, so as to obtain a basic data dictionary;
the basic data dictionary returning module is used for returning the basic data dictionary to the user side so as to obtain the modified basic data dictionary returned by the user side;
the matching result generation module is used for matching the basic data dictionary with the modified basic data dictionary to obtain a matching result;
and the target data dictionary determining module is used for taking the modified basic data dictionary as the target data dictionary if the matching result is that the basic data dictionary and the modified basic data dictionary are successfully matched.
9. A computer device comprising a memory in which a computer program is stored and a processor that implements the method of constructing a data dictionary according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the method of constructing a data dictionary according to any one of claims 1 to 7.
CN202110737090.2A 2021-06-30 2021-06-30 Data dictionary construction method and device, computer equipment and storage medium Pending CN113177407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110737090.2A CN113177407A (en) 2021-06-30 2021-06-30 Data dictionary construction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110737090.2A CN113177407A (en) 2021-06-30 2021-06-30 Data dictionary construction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113177407A true CN113177407A (en) 2021-07-27

Family

ID=76927937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110737090.2A Pending CN113177407A (en) 2021-06-30 2021-06-30 Data dictionary construction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113177407A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113971210A (en) * 2021-12-27 2022-01-25 宇动源(北京)信息技术有限公司 Data dictionary generation method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211352A (en) * 2006-12-28 2008-07-02 凌阳科技股份有限公司 Electronic dictionary data update system and its method
CN101782998A (en) * 2009-01-20 2010-07-21 复旦大学 Intelligent judging method for illegal on-line product information and system
CN103324611A (en) * 2013-07-03 2013-09-25 姚明东 Method of acquiring semantic relation of words in E-commerce field on the basis of progressive dimensionality reduction
CN104239367A (en) * 2013-06-21 2014-12-24 苏州精易会信息技术有限公司 Spreadsheet data management method based on B/S mode
CN109542851A (en) * 2018-11-30 2019-03-29 北京金山云网络技术有限公司 File updating method, apparatus and system
US10614253B2 (en) * 2018-02-14 2020-04-07 Fortune Vieyra Systems and methods for state of data management
CN112434506A (en) * 2020-11-25 2021-03-02 平安普惠企业管理有限公司 Electronic protocol signing processing method, device, computer equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211352A (en) * 2006-12-28 2008-07-02 凌阳科技股份有限公司 Electronic dictionary data update system and its method
CN101782998A (en) * 2009-01-20 2010-07-21 复旦大学 Intelligent judging method for illegal on-line product information and system
CN104239367A (en) * 2013-06-21 2014-12-24 苏州精易会信息技术有限公司 Spreadsheet data management method based on B/S mode
CN103324611A (en) * 2013-07-03 2013-09-25 姚明东 Method of acquiring semantic relation of words in E-commerce field on the basis of progressive dimensionality reduction
US10614253B2 (en) * 2018-02-14 2020-04-07 Fortune Vieyra Systems and methods for state of data management
CN109542851A (en) * 2018-11-30 2019-03-29 北京金山云网络技术有限公司 File updating method, apparatus and system
CN112434506A (en) * 2020-11-25 2021-03-02 平安普惠企业管理有限公司 Electronic protocol signing processing method, device, computer equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SETH T. ROSS: "《UNIX系统安全工具》", 30 April 2000, 机械工业出版社 *
涂敏 等: "《网络安全与管理》", 28 February 2009, 江西高校出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113971210A (en) * 2021-12-27 2022-01-25 宇动源(北京)信息技术有限公司 Data dictionary generation method and device, electronic equipment and storage medium
CN113971210B (en) * 2021-12-27 2022-04-08 宇动源(北京)信息技术有限公司 Data dictionary generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
AU2017408800B2 (en) Method and system of mining information, electronic device and readable storable medium
CN108959559B (en) Question and answer pair generation method and device
CN104008093A (en) Method and system for chinese name transliteration
WO2008103894A1 (en) Automated word-form transformation and part of speech tag assignment
CN113177407A (en) Data dictionary construction method and device, computer equipment and storage medium
EP3769240A1 (en) Machine translation locking using sequence-based lock/unlock classification
CN112182224A (en) Referee document abstract generation method and device, electronic equipment and readable storage medium
CN110019640B (en) Secret-related file checking method and device
US20080040352A1 (en) Method for creating a disambiguation database
CN111475700A (en) Data extraction method and related equipment
CN110837635A (en) Method, device, equipment and storage medium for equipment verification
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
WO2022105120A1 (en) Text detection method and apparatus from image, computer device and storage medium
CN110738056A (en) Method and apparatus for generating information
CN112650858A (en) Method and device for acquiring emergency assistance information, computer equipment and medium
CN112395391A (en) Concept graph construction method and device, computer equipment and storage medium
CN112613917A (en) Information pushing method, device and equipment based on user portrait and storage medium
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN112559672B (en) Information detection method, electronic device and computer storage medium
CN113204613B (en) Address generation method, device, equipment and storage medium
CN108932326B (en) Instance extension method, device, equipment and medium
CN114091435A (en) Text content checking method and device, electronic equipment and storage medium
CN107644043B (en) Internet bank quick navigation setting method and system
CN111259262A (en) Information retrieval method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210727