CN115310462B - Metadata recognition translation method and system based on NLP technology - Google Patents
Metadata recognition translation method and system based on NLP technology Download PDFInfo
- Publication number
- CN115310462B CN115310462B CN202211237199.0A CN202211237199A CN115310462B CN 115310462 B CN115310462 B CN 115310462B CN 202211237199 A CN202211237199 A CN 202211237199A CN 115310462 B CN115310462 B CN 115310462B
- Authority
- CN
- China
- Prior art keywords
- output
- segmentation
- data
- translation
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013519 translation Methods 0.000 title claims abstract description 88
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000011218 segmentation Effects 0.000 claims abstract description 102
- 238000012545 processing Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims description 15
- 230000014509 gene expression Effects 0.000 claims description 7
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000004821 distillation Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000000586 desensitisation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000001681 protective effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000002344 surface layer Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3337—Translation of the query language, e.g. Chinese to English
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a metadata recognition translation method and system based on an NLP technology, mainly relates to the technical field of data recognition translation, and is used for solving the problems that the existing translation model cannot well give consideration to translation, misspelling, multiple meanings and the like. The method comprises the following steps: detecting whether the input metadata is of an output text type, and determining that the input metadata of the output text type is a first output result; performing polysemous word processing on input metadata carrying sample data values to obtain a second output result; performing text segmentation on input metadata which does not carry sample data values to obtain a plurality of segmented data; identifying a preset text type corresponding to the segmentation data, and determining segmentation output of the segmentation data; detecting whether the segmentation output is a polysemous word or not so as to perform polysemous word processing and obtain a third output result; and splicing the output results to finish the identification and translation of the input metadata. The method realizes the translation and the detection of spelling errors and multiple meanings.
Description
Technical Field
The application relates to the technical field of metadata identification and translation, in particular to a metadata identification and translation method and system based on an NLP technology.
Background
With the advent of the big data age, the risk impact range of data security has gradually radiated from individuals, enterprises to industries and even countries. The premise of protecting data security lies in that it is required to make a correct distinction of data which is required to be secret, and the metadata identification is used as the basis of data security control, and can identify the surface layer meaning of metadata of translated library name, table name and field name, and can reflect the requirements of sensitivity, importance and compliance of data, then the data resources can be classified and protected and managed by using downstream technical means and protective measures.
At present, most enterprises and units use a machine translation technology to realize metadata identification translation such as library names, table names, field names and the like, label translation is carried out on concerned data of respective enterprises, and an open source label data set is combined for model training. And deploying the trained translation service and returning the translation result for downstream use. Therefore, the metadata recognition translation is used as a foundation of the data security product and plays a crucial role in judging the quality of downstream results and even the final results.
Although the machine translation model technology is relatively mature, it is mostly used in the open field and trained based on a large amount of labeled corpus. In actual situations, the data produced by enterprises are more and complex, the time span is large, and accurate and complete marking is difficult to achieve; and the authority and the service among all the teams are crossed, and the clear translation result is not easy to clear. In addition, the machine translation model cannot give good consideration to English shorthand or misspelling, and a lot of data which do not accord with naming standards, including pinyin, pinyin abbreviations and temporary random naming, often appear in a database in long time, which also causes that the translation model cannot give good consideration to English shorthand or misspelling.
Disclosure of Invention
In view of the above disadvantages in the prior art, the present invention provides a metadata recognition and translation method and system based on NLP technology to solve the above technical problems.
In a first aspect, the present application provides a metadata recognition translation method based on NLP technology, and the method includes: detecting whether the input metadata is of an output text type, and determining that the input metadata of the output text type is a first output result; detecting whether input metadata which is not of the output text type carries sample data values or not; performing polysemous word processing on input metadata carrying sample data values to obtain a second output result; performing text segmentation on input metadata which does not carry sample data values to obtain a plurality of segmented data; identifying a preset text type corresponding to the segmentation data, and determining segmentation output of the segmentation data based on a text translation algorithm corresponding to the preset text type; detecting whether the segmentation output is a polysemous word or not so as to perform polysemous word processing and obtain a third output result; and splicing the first output result, the second output result and the third output result to finish the identification and translation of the input metadata.
Further, performing text segmentation on the input metadata not carrying the sample data value to obtain a plurality of segmented data, specifically comprising: splitting input metadata which does not carry sample data values into first data which accords with an output text type and second data which does not accord with the output text type; adding a sequence key value in the first data and the second data; and sequentially splitting the second data based on a preset splitting rule to obtain a plurality of split data.
Further, before identifying a preset text type corresponding to the segmented data to determine a text translation algorithm corresponding to the segmented data based on the preset text type, the method further includes: constructing an algorithm module corresponding to a text translation algorithm, and acquiring training sample data to obtain a trained algorithm module; and carrying out model distillation processing on the trained algorithm module to obtain a text translation algorithm.
Further, determining the segmentation output of the segmentation data based on a text translation algorithm corresponding to the preset text type, specifically comprising: and inputting the segmentation data into a text translation algorithm so that the text translation algorithm determines segmentation output corresponding to the segmentation data from a preset dictionary library.
Further, the method further comprises: when the preset text type is an English text and no segmentation output corresponding to the segmentation data exists in the preset dictionary base, querying the segmentation output with the editing distance equal to N from the segmentation data through an editing distance algorithm and a preset distance threshold; and the initial value of N is 1, and the N +1 operation is executed once every 1 query operation is carried out until the segmentation output is obtained through query or N is larger than a preset distance threshold.
Further, the method further comprises: and when the N is larger than the preset distance threshold value and no segmentation output is found, acquiring the hidden words of the segmentation data existing in the dictionary through a preset maximum bidirectional matching algorithm, and returning the hidden words as segmentation output.
Further, detecting whether the segmentation output is a polysemous word or not to perform polysemous word processing to obtain a third output result, which specifically comprises: determining the similarity between the segmentation output and the context vocabulary through a Bert Chinese pre-training word vector representation and a cosine similarity algorithm, and acquiring a result with the highest similarity as a third output result; when the highest similarity is detected to be smaller than a preset similarity threshold value, acquiring context information corresponding to segmentation output through a trained reading understanding accumulation model; and based on the context information, selecting and segmenting to output a corresponding third output result by utilizing the principle of word selection and blank filling.
Further, after detecting whether the segmentation output is an ambiguous word to perform ambiguous word processing and obtain a third output result, the method further includes: identifying the privacy data in the third output result through a preset privacy data processing algorithm; encrypting the private data by presetting a private data processing rule; the preset privacy data processing algorithm at least comprises the following steps: the method comprises an NER named entity recognition model based on the tingbert pre-training, an official mathematical algorithm and a regular expression matching method.
In a second aspect, the present application provides a metadata recognition and translation system based on NLP technology, the system includes: the text filtering module is used for detecting whether the input metadata is of an output text type and determining that the input metadata of the output text type is a first output result; detecting whether input metadata which is not of the output text type carries sample data values or not; performing polysemous word processing on input metadata carrying sample data values to obtain a second output result; the text splitting module is used for performing text splitting on the input metadata which does not carry the sample data value to obtain a plurality of split data; identifying a preset text type corresponding to the segmentation data; the language identification module is used for corresponding to a text translation algorithm based on a preset text type; the preset translation module is used for determining the segmentation output of the segmentation data; wherein, predetermine the translation module and include at least: a pinyin abbreviation translation sub-module, an English dictionary translation sub-module and a pinyin translation sub-module; the ambiguous word processing module is used for detecting whether the segmentation output is ambiguous words; performing polysemous word processing to obtain a third output result; and the result output module is used for splicing the first output result, the second output result and the third output result to finish the identification and translation of the input metadata.
Further, the system further comprises: an edit distance module and a maximum matching module; the editing distance module is used for inquiring segmentation output with the editing distance equal to N from the segmentation data through an editing distance algorithm and a preset distance threshold when the preset text type is an English text and no segmentation output corresponding to the segmentation data exists in a preset dictionary base; the initial value of N is 1, and once query operation occurs for 1 time, executing the operation of N +1 time until the segmentation output is obtained by query or N is larger than a preset distance threshold; and the maximum matching module is used for acquiring hidden words of the segmentation data existing in the dictionary through a preset maximum bidirectional matching algorithm when the condition that the N is larger than a preset distance threshold value and no segmentation output is found, and returning the hidden words as segmentation output.
As can be appreciated by those skilled in the art, the present invention has at least the following beneficial effects:
the invention better solves the problem of database metadata identification by collecting, cleaning, matching and understanding the identification of data, is not limited to English, and also comprises multiple languages of pinyin and pinyin abbreviation. The recognition efficiency is high, the translation rate of 1000 pieces of data can be kept within 800 milliseconds after verification, and the accuracy is kept above 85%. And the system supports manual auditing and correcting iteration, and improves the accuracy, flexibility and high efficiency of the system. The method well meets the future requirements of data security management and development, and can continuously expand and quickly adapt to new industries.
Drawings
Some embodiments of the disclosure are described below with reference to the accompanying drawings, in which:
fig. 1 is a flowchart of a metadata identification and translation method based on NLP technology according to an embodiment of the present application.
Fig. 2 is a schematic diagram of an internal structure of a metadata recognition translation system based on NLP technology according to an embodiment of the present application.
Detailed Description
It should be understood by those skilled in the art that the embodiments described below are only preferred embodiments of the present disclosure, and do not mean that the present disclosure can be implemented only by the preferred embodiments, which are merely for explaining the technical principles of the present disclosure and are not intended to limit the scope of the present disclosure. All other embodiments that can be derived by one of ordinary skill in the art from the preferred embodiments provided by the disclosure and that fall within the scope of the disclosure are intended to be encompassed by the present disclosure without any inventive step.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The technical solutions proposed in the embodiments of the present application are explained in detail below with reference to the accompanying drawings.
An embodiment of the present application provides a metadata identification and translation method based on an NLP technology, and as shown in fig. 1, the method provided by the embodiment of the present application mainly includes the following steps:
step 110, detecting whether the input metadata is of an output text type, and determining that the input metadata of the output text type is a first output result; it is detected whether input metadata that is not of the output text type carries a sample data value.
It should be noted that the output text type may be a plain text type. The sample data value is a language classification data value.
Step 120, performing polysemous word processing on input metadata carrying sample data values to obtain a second output result; and performing text segmentation on the input metadata which does not carry the sample data value to obtain a plurality of segmented data.
Performing text segmentation on input metadata which does not carry sample data values to obtain a plurality of segmentation data, which specifically can be: splitting input metadata which does not carry sample data values into first data which accords with an output text type and second data which does not accord with the output text type; and sequentially splitting the second data based on a preset splitting rule to obtain a plurality of split data. It should be noted that the preset splitting rule may be any feasible method capable of performing semantic splitting.
As an example, for a character string (second data) containing english, first, a phrase splitting is performed, and the input phrase is split according to case, underline or other special symbols, such as "yuangong _ name" into yuangong and name; "UserName" is split into user and name; splitting the jobName into a jobName and a jobName; "startTime" is split into start and time; the 'activity data' is divided into 'activity' and name, wherein Chinese information (first data) is temporarily reserved, and after full text translation is finished, splicing is carried out, and the divided data is returned.
Step 130, identifying a preset text type corresponding to the segmentation data, and determining segmentation output of the segmentation data based on a text translation algorithm corresponding to the preset text type; and detecting whether the segmentation output is a polysemous word or not so as to perform polysemous word processing and obtain a third output result.
In order to make the algorithm have higher precision and more accurate result, before identifying the preset text type corresponding to the segmented data and determining the text translation algorithm corresponding to the segmented data based on the preset text type, the method may further include: constructing an algorithm module corresponding to a text translation algorithm, and acquiring training sample data to obtain a trained algorithm module; and carrying out model distillation processing on the trained algorithm module to obtain a text translation algorithm.
In addition, when the preset text type is an English text and no segmentation output corresponding to the segmentation data exists in the preset dictionary library, the segmentation output with the editing distance equal to N from the segmentation data can be inquired through an editing distance algorithm and a preset distance threshold; and the initial value of N is 1, and the N +1 operation is executed once every 1 query operation is carried out until the segmentation output is obtained through query or N is larger than a preset distance threshold. And when the N is larger than the preset distance threshold value and no segmentation output is found, acquiring the hidden words of the segmentation data existing in the dictionary through a preset maximum bidirectional matching algorithm, and returning the hidden words as segmentation output. It should be noted that the preset distance threshold may be any feasible value.
In order to effectively obtain a third output result corresponding to the segmentation output, whether the segmentation output is a polysemous word is detected to perform polysemous word processing, so as to obtain the third output result, which may specifically be: determining the similarity between the segmentation output and the context vocabulary through a Bert Chinese pre-training word vector representation and a cosine similarity algorithm, and acquiring a result with the highest similarity as a third output result; when the highest similarity is detected to be smaller than a preset similarity threshold value, acquiring context information corresponding to segmentation output through a trained reading understanding accumulation model; and based on the context information, selecting and segmenting to output a corresponding third output result by utilizing the principle of word selection and blank filling. It should be noted that the preset similarity threshold may be any feasible data value.
In addition, the method and the device can also identify the private data, and further carry out privacy removal processing on the private data. Specifically, after detecting whether the segmentation output is an ambiguous word to perform ambiguous word processing and obtain a third output result, the method further includes: identifying the privacy data in the third output result through a preset privacy data processing algorithm; encrypting the private data by presetting a private data processing rule; the preset privacy data processing algorithm at least comprises the following steps: the method comprises an NER named entity recognition model based on tingbert pre-training, an official mathematical algorithm and a regular expression matching method. It should be noted that the preset privacy data processing rule may specifically be: (1) Data such as 'names of people', 'professions', 'addresses', 'company names' and the like can be recognized and classified as sensitive data through an NER named entity recognition model based on tingbert pre-training. Due to the expandability of the AI model, the AI model can be expanded in the later period by combining with training data according to requirements, for example, special entity identifications such as 'industry', 'disease', 'medicine name' and the like are added, and the expandability is strong. (2) Unified social credit codes, business license numbers, organizational codes, identification cards, bank cards, and the like can be identified as sensitive data through official mathematical algorithms. The algorithm is strictly according to official algorithm rules, and the accuracy is high. (3) Part of the data can be matched by using a regular expression (by adopting the regular expression rule which has a mature standard in the industry), and telephone numbers, IP addresses, url addresses and the like can be matched by the regular expression. Due to the algorithmic nature of regular expressions, the matching rate is high.
And 140, splicing the first output result, the second output result and the third output result to finish the identification and translation of the input metadata.
In addition, fig. 2 is a metadata recognition translation system based on NLP technology according to an embodiment of the present application. As shown in fig. 2, the system provided in the embodiment of the present application mainly includes:
the text filtering module 210 is any feasible module capable of filtering each character string entering the system, filtering a pure chinese character string, returning a result as it is, processing only a character string with english, and is mainly used for detecting whether input metadata is an output text type, and determining that the input metadata of the output text type is a first output result; detecting whether input metadata which is not of an output text type carries a sample data value; and performing polysemous word processing on the input metadata carrying the sample data value to obtain a second output result.
The text splitting module 220 is configured to perform text segmentation on the input metadata that does not carry sample data values, so as to obtain a plurality of segmented data; and identifying a preset text type corresponding to the segmentation data.
As an example, for a character string containing english, first, splitting a phrase, and splitting an input phrase according to case, underline or other special symbols, for example, "yuangong _ name" into yuangong and name; "UserName" splits into user and name; splitting the jobName into a jobName and a jobName; "startTime" is split into start and time; the 'activity data' is divided into 'activity' and name, wherein the Chinese information is temporarily reserved, and after the full text is translated, the Chinese information is spliced and the divided data is returned.
The language identification module 230 is configured to identify a corresponding text translation algorithm based on a preset text type.
As an example, the tinybert pre-training model can be used to train a language classification model to realize language identification, model distillation is performed on the trained model, essence learned by the original model is retained, the model volume is reduced, compared with the original model, the model volume is reduced from 50M to 20M by more than half, and the model identification speed after distillation is improved by 30%. The language identification model identifies character strings mainly by English, pinyin and Pinyin abbreviations. And respectively realizing translation by different downstream methods according to different results. The recognition accuracy of the current language recognition model is more than 99%. For example, name can be identified as an English word, yuangong can be identified as Pinyin, and cd can be identified as Pinyin abbreviation.
The preset translation module 240 is configured to determine a segmentation output of the segmentation data; the preset translation module 240 at least includes: a spelling abbreviation translation sub-module, an English dictionary translation sub-module and a spelling translation sub-module.
Several sub-modules in the preset translation module 240 may search matching results in the dictionary base for the words identified as english. The dictionary base is characterized in that a large amount of collected data including network open source data, english dictionaries, enterprise desensitization data and the like are utilized in the early stage, reasonable and common phrases are reserved after cleaning and manual review, a minimum granularity root base is constructed, the dictionary base is constructed, and the dictionary base can be used for providing translation results through direct mapping, such as name: name, name; job: and (5) working and tasks. Wherein, the pinyin translation submodule is: aiming at the character string identified as pinyin, collecting and cleaning vertical field linguistic data, training a pinyin identification model based on a hidden Markov model to realize pinyin translation, for example, "yuangong" can translate into "staff", and "shenfenzhenghao" can translate into "identity card number", and returning (segmenting and outputting) the translation result. Wherein, the pinyin abbreviation translation submodule: aiming at the character string identified as the pinyin abbreviation, a pinyin abbreviation dictionary table is collected in the early stage, common and sensitive words are collected to construct the pinyin abbreviation dictionary table, such as "sfz" - > "identity card", "cd" - > "achievement, length, degree", and the like. When the pinyin abbreviation character string hits the dictionary table, the corresponding result is returned (segmentation output).
When the result of the pre-translation module 240 is not directly mapped through the dictionary, the result meeting the requirement that the edit distance is less than or equal to the preset distance threshold (the preset distance threshold is configurable, 1 or 2 can be configured if required, preferably 1 is suggested) can be queried through the edit distance module 250 by using the edit distance algorithm, and the recognition result is returned. The editing distance comprises 3 modes of adding, deleting and changing, and 1 editing distance is accumulated every 1 time until the segmentation output is inquired or the distance is larger than a preset distance threshold. For example, when the preset distance threshold is 1, the user inputs a name, and may want to input a name, but when spelling is input, the name is incorrectly input as the name, and the name can be changed into a correct result name after the minimum operation of changing the last character "a" to "e", and the operation of changing "a" is performed, so that the preset distance threshold is 1, and the "name" is considered as the best matching result (split output).
When editing data that the distance module 250 cannot match, the maximum matching module 260 may match the result of the maximum two-way matching algorithm to obtain a hidden word existing in the dictionary, and return a recognition result, if a jgldname is encountered, the maximum two-way matching algorithm may be used to match the name in the dictionary database for translation. The system keeps the best matching result in the process of bidirectional maximum matching. For example, update, three vocabularies of up, date, update and team can be matched, and the result of matching the [ update ] is determined to be obviously better than the [ up, date ] and the [ up, team ] based on the preset vocabulary weight. Wherein the preset vocabulary weight is a value edited by a person skilled in the art according to actual conditions.
An ambiguous word processing module 270, configured to detect whether the segmentation output is an ambiguous word; and performing polysemous word processing to obtain a third output result.
Illustratively, the polysemous word processing module 270 characterizes the matched and translated vocabulary by using a Bert Chinese pre-training word vector for the polysemous words, calculates the similarity between the vocabulary and the context vocabulary by using a cosine similarity algorithm, and recalls the result which best meets the context environment. If chengdu may translate to "Chengdu" or "degree", the result with the highest similarity matching score can be returned by calculating the similarity according to the context information, and if the words such as "plan", "town and country" and the like appear in the context words, the translation result of "Chengdu" is returned.
And the result output module 280 is configured to perform splicing processing on the first output result, the second output result, and the third output result to complete recognition and translation of the input metadata.
So far, the technical solutions of the present disclosure have been described in connection with the foregoing embodiments, but it is easily understood by those skilled in the art that the scope of the present disclosure is not limited to only these specific embodiments. The technical solutions in the above embodiments can be split and combined, and equivalent changes or substitutions can be made on related technical features by those skilled in the art without departing from the technical principles of the present disclosure, and any changes, equivalents, improvements, and the like made within the technical concept and/or technical principles of the present disclosure will fall within the protection scope of the present disclosure.
Claims (7)
1. A metadata recognition translation method based on NLP technology is characterized by comprising the following steps:
detecting whether the input metadata is of an output text type, and determining that the input metadata of the output text type is a first output result; detecting whether input metadata which is not of an output text type carries a sample data value;
performing polysemous word processing on input metadata carrying sample data values to obtain a second output result; performing text segmentation on input metadata which does not carry sample data values to obtain a plurality of segmented data;
the method includes the steps of performing text segmentation on input metadata which does not carry sample data values to obtain a plurality of segmentation data, and specifically includes the following steps: splitting input metadata which does not carry sample data values into first data which accords with an output text type and second data which does not accord with the output text type; adding a sequence key value in the first data and the second data; sequentially splitting the second data based on a preset splitting rule to obtain a plurality of split data; the output text type corresponds to Chinese information, and the output text type does not correspond to non-Chinese information;
identifying a preset text type corresponding to the segmentation data, and determining segmentation output of the segmentation data based on a text translation algorithm corresponding to the preset text type; when the preset text type is an English text and no segmentation output corresponding to the segmentation data exists in a preset dictionary library, querying the segmentation output with the editing distance equal to N from the segmentation data through an editing distance algorithm and a preset distance threshold; the initial value of N is 1, and once query operation occurs for 1 time, executing the operation of N +1 time until the segmentation output is obtained by query or N is larger than a preset distance threshold;
detecting whether the segmentation output is a polysemous word or not so as to perform polysemous word processing and obtain a third output result;
and splicing the first output result, the second output result and the third output result to finish the identification and translation of the input metadata.
2. The NLP technology-based metadata recognition translation method according to claim 1, wherein before recognizing the preset text type corresponding to the sliced data to determine the text translation algorithm corresponding to the sliced data based on the preset text type, the method further comprises:
constructing an algorithm module corresponding to a text translation algorithm, and acquiring training sample data to obtain a trained algorithm module;
and carrying out model distillation processing on the trained algorithm module to obtain a text translation algorithm.
3. The NLP technology-based metadata recognition and translation method according to claim 1, wherein determining the segmentation output of the segmented data based on the text translation algorithm corresponding to the preset text type specifically includes:
and inputting the segmentation data into a text translation algorithm so that the text translation algorithm determines the segmentation output corresponding to the segmentation data from a preset dictionary library.
4. The NLP technology-based metadata recognition translation method according to claim 1, further comprising:
and when the N is larger than a preset distance threshold value and no segmentation output is found, acquiring the hidden words of the segmentation data existing in the dictionary through a preset maximum bidirectional matching algorithm, and returning the hidden words as segmentation output.
5. The NLP technology-based metadata recognition translation method according to claim 1, wherein detecting whether the segmentation output is a polysemous word, so as to perform polysemous word processing, and obtain a third output result specifically includes:
determining the similarity between the segmentation output and the context vocabulary through a Bert Chinese pre-training word vector representation and a cosine similarity algorithm, and acquiring a result with the highest similarity as a third output result;
when the highest similarity is detected to be smaller than a preset similarity threshold value, acquiring context information corresponding to segmentation output through a trained reading understanding accumulation model; and based on the context information, selecting and segmenting to output a corresponding third output result by utilizing the principle of word selection and blank filling.
6. The NLP technology-based metadata recognition translation method according to claim 1, wherein after detecting whether the segmentation output is a polysemous word or not for performing polysemous word processing to obtain a third output result, the method further comprises:
identifying the privacy data in the third output result through a preset privacy data processing algorithm; encrypting the private data by presetting a private data processing rule; wherein the preset privacy data processing algorithm at least comprises: the method comprises an NER named entity recognition model based on the tingbert pre-training, an official mathematical algorithm and a regular expression matching method.
7. A metadata recognition translation system based on NLP technology, the system comprising:
the text filtering module is used for detecting whether the input metadata is of an output text type and determining that the input metadata of the output text type is a first output result; detecting whether input metadata which is not of the output text type carries sample data values or not; performing polysemous word processing on input metadata carrying sample data values to obtain a second output result;
the text splitting module is used for performing text splitting on the input metadata which does not carry the sample data value to obtain a plurality of split data; identifying a preset text type corresponding to the segmentation data;
a language identification module for identifying the language type, the text translation algorithm is used for corresponding to the preset text type;
the preset translation module is used for determining the segmentation output of the segmentation data; wherein, the preset translation module at least comprises: a pinyin abbreviation translation sub-module, an English dictionary translation sub-module and a pinyin translation sub-module;
the ambiguous word processing module is used for detecting whether the segmentation output is an ambiguous word; performing polysemous word processing to obtain a third output result;
the result output module is used for splicing the first output result, the second output result and the third output result to finish the identification and translation of the input metadata;
the system further comprises: an edit distance module and a maximum matching module;
the editing distance module is used for inquiring segmentation output with the editing distance equal to N from the segmentation data through an editing distance algorithm and a preset distance threshold when the preset text type is an English text and no segmentation output corresponding to the segmentation data exists in a preset dictionary library; the initial value of N is 1, and once query operation occurs for 1 time, executing the operation of N +1 time until the segmentation output is obtained by query or N is larger than a preset distance threshold;
and the maximum matching module is used for acquiring hidden words of the segmentation data existing in the dictionary through a preset maximum bidirectional matching algorithm when the condition that N is larger than a preset distance threshold value and no segmentation output exists is inquired, and returning the obtained hidden words as segmentation output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211237199.0A CN115310462B (en) | 2022-10-11 | 2022-10-11 | Metadata recognition translation method and system based on NLP technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211237199.0A CN115310462B (en) | 2022-10-11 | 2022-10-11 | Metadata recognition translation method and system based on NLP technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115310462A CN115310462A (en) | 2022-11-08 |
CN115310462B true CN115310462B (en) | 2023-03-24 |
Family
ID=83867770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211237199.0A Active CN115310462B (en) | 2022-10-11 | 2022-10-11 | Metadata recognition translation method and system based on NLP technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115310462B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019229769A1 (en) * | 2018-05-28 | 2019-12-05 | Thottapilly Sanjeev | An auto-disambiguation bot engine for dynamic corpus selection per query |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7478033B2 (en) * | 2004-03-16 | 2009-01-13 | Google Inc. | Systems and methods for translating Chinese pinyin to Chinese characters |
US20070106499A1 (en) * | 2005-08-09 | 2007-05-10 | Kathleen Dahlgren | Natural language search system |
US8731901B2 (en) * | 2009-12-02 | 2014-05-20 | Content Savvy, Inc. | Context aware back-transliteration and translation of names and common phrases using web resources |
CN107608973A (en) * | 2016-07-12 | 2018-01-19 | 华为技术有限公司 | A kind of interpretation method and device based on neutral net |
CN108563645B (en) * | 2018-04-24 | 2022-03-22 | 成都智信电子技术有限公司 | Metadata translation method and device of HIS (hardware-in-the-system) |
CN110210026B (en) * | 2019-05-29 | 2023-05-26 | 北京百度网讯科技有限公司 | Speech translation method, device, computer equipment and storage medium |
US12032923B2 (en) * | 2020-07-09 | 2024-07-09 | Samsung Electronics Co., Ltd. | Electronic device and method for translating language |
CN112487793A (en) * | 2020-11-27 | 2021-03-12 | 江苏省舜禹信息技术有限公司 | Pre-translation editing method and system for machine translation of patent text |
CN113723116B (en) * | 2021-08-25 | 2024-02-13 | 中国科学技术大学 | Text translation method and related device, electronic equipment and storage medium |
-
2022
- 2022-10-11 CN CN202211237199.0A patent/CN115310462B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019229769A1 (en) * | 2018-05-28 | 2019-12-05 | Thottapilly Sanjeev | An auto-disambiguation bot engine for dynamic corpus selection per query |
Also Published As
Publication number | Publication date |
---|---|
CN115310462A (en) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107045496B (en) | Error correction method and error correction device for text after voice recognition | |
CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
JP4568774B2 (en) | How to generate templates used in handwriting recognition | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN109597994A (en) | Short text problem semantic matching method and system | |
CN114036930A (en) | Text error correction method, device, equipment and computer readable medium | |
CN102279843A (en) | Method and device for processing phrase data | |
CN113158653A (en) | Training method, application method, device and equipment for pre-training language model | |
CN112380848B (en) | Text generation method, device, equipment and storage medium | |
KR101887629B1 (en) | system for classifying and opening information based on natural language | |
CN111090994A (en) | Chinese-internet-forum-text-oriented event place attribution province identification method | |
CN110532569B (en) | Data collision method and system based on Chinese word segmentation | |
US20100094615A1 (en) | Document translation apparatus and method | |
CN111178080A (en) | Named entity identification method and system based on structured information | |
KR101941692B1 (en) | named-entity recognition method and apparatus for korean | |
Al Taawab et al. | Transliterated bengali comment classification from social media | |
Ahmed et al. | Question analysis for Arabic question answering systems | |
CN112000782A (en) | Intelligent customer service question-answering system based on k-means clustering algorithm | |
CN112784227A (en) | Dictionary generating system and method based on password semantic structure | |
CN112183060A (en) | Reference resolution method of multi-round dialogue system | |
CN113761137A (en) | Method and device for extracting address information | |
CN115310462B (en) | Metadata recognition translation method and system based on NLP technology | |
CN112989820B (en) | Legal document positioning method, device, equipment and storage medium | |
Maheswari et al. | Rule based morphological variation removable stemming algorithm | |
CN110837735B (en) | Intelligent data analysis and identification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A metadata recognition and translation method and system based on NLP technology Granted publication date: 20230324 Pledgee: Jinan Branch of Qingdao Bank Co.,Ltd. Pledgor: ZHONGFU INFORMATION Co.,Ltd. Registration number: Y2024980023941 |