WO2022142703A1 - Standardization processing method and apparatus for text, and electronic device and computer medium - Google Patents

Standardization processing method and apparatus for text, and electronic device and computer medium Download PDF

Info

Publication number
WO2022142703A1
WO2022142703A1 PCT/CN2021/127971 CN2021127971W WO2022142703A1 WO 2022142703 A1 WO2022142703 A1 WO 2022142703A1 CN 2021127971 W CN2021127971 W CN 2021127971W WO 2022142703 A1 WO2022142703 A1 WO 2022142703A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
information
historical
classification
original
Prior art date
Application number
PCT/CN2021/127971
Other languages
French (fr)
Chinese (zh)
Inventor
滕召荣
刘斌
郝东林
Original Assignee
医渡云(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 医渡云(北京)技术有限公司 filed Critical 医渡云(北京)技术有限公司
Publication of WO2022142703A1 publication Critical patent/WO2022142703A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present disclosure relates to the technical field of data processing, and in particular, to a method for standardizing text, an apparatus for standardizing text, an electronic device, and a computer-readable medium.
  • the purpose of the present disclosure is to provide a text normalization processing method, a text normalization processing device, an electronic device, and a computer-readable medium, so as to improve the efficiency and accuracy of text normalization at least to a certain extent.
  • a method for standardizing text comprising:
  • the original information text includes the original text to be processed
  • the standardized text corresponding to the original text is obtained according to the standard text components.
  • an apparatus for standardizing text comprising:
  • an original information text acquisition module configured to execute and acquire original information text, the original information text includes the original text to be processed;
  • an original information text matching module configured to perform matching on the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;
  • a valid text component acquisition module configured to perform word segmentation processing on the target text, to obtain each valid text component contained in the target text
  • a standard text component determination module configured to execute and acquire a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;
  • the standardized text generation module is configured to obtain standardized text corresponding to the original text according to the standard text components.
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to Execute the normalization processing method of the text described in any one of the above.
  • a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the above-described normalization processing methods for text.
  • FIG. 1 shows a schematic flowchart of Malay name normalization according to a related embodiment of the present disclosure
  • FIG. 2 shows a schematic flowchart of a text standardization processing method according to an exemplary embodiment of the present disclosure
  • FIG. 3 shows a schematic flowchart of a method for generating an information text thesaurus according to an exemplary embodiment of the present disclosure
  • FIG. 4 shows a schematic flowchart of obtaining multiple sets of similar information text sets according to an exemplary embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a method for generating a text component rule set according to an exemplary embodiment of the present disclosure
  • FIG. 6 shows a schematic flowchart of a method for standardizing text in a specific embodiment of the present disclosure
  • FIG. 7 shows a schematic flowchart of a method for generating an information text thesaurus according to a specific embodiment of the present disclosure
  • FIG. 8 shows a schematic flowchart of a method for generating a text component rule set according to an embodiment of the present disclosure
  • FIG. 9 shows a block diagram of a text normalization processing apparatus according to an exemplary embodiment of the present disclosure.
  • FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure.
  • those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed.
  • well-known solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
  • Normalization refers to the standardization of data. After different data is processed by normalization (a certain algorithm), it can be made into the same standard data. Data normalization or data normalization is a direction of NLP (Natural Language Processing) technology, which refers to the process of data normalization processing through NLP technical means.
  • NLP Natural Language Processing
  • the most typical one is to use the name, date and gender to calculate the unique identification of the person.
  • the names of personnel in the Malay language system are quite special, and the most notable feature is that with the increase of age, the names will continue to change.
  • the Malay name will add the symbol of adulthood when a person is an adult; when a certain social title is obtained, the symbol of the title will be added; when going to a religious holy place, it will be added to the name. place identification.
  • This kind of name change brings great challenges to the unique identification of computing personnel. Therefore, it is necessary to standardize the name text.
  • the name of the Malay system is composed of title + duplicate name + first name + title + parent title + parent duplicate name + parent first name, including title, duplicate name, title, parent title, parent duplicate name
  • They are all variable parts and may change over time. Therefore, for the normalization of Malay names, it refers to removing the variable parts of Malay names by technical means, leaving only the fixed or immutable parts.
  • Step S102 Obtain the Malay name text.
  • Step S104 Name text preprocessing.
  • the preprocessing process includes cleaning some special characters, such as ")”, “(", ".” and other symbols; in addition, it is also necessary to remove Some meaningless special words, such as "unknown”, “B/O", “Baby of", etc.
  • Step S106 Name text segmentation.
  • Words can be split according to spaces.
  • Step S108 Obtain the Malay name and word frequency mapping table.
  • the Malay name and word count mapping table is to count the number of occurrences of the word in the Malay name in the historical text data, and build a HASH (Hash) mapping relationship of the number of words according to the statistical result, where the number of times refers to all The total number of occurrences of the name word.
  • HASH Hash mapping relationship of the number of words according to the statistical result, where the number of times refers to all The total number of occurrences of the name word.
  • the above-mentioned Malay name word frequency mapping table can be used as a basic dictionary for name unification.
  • HASH map also known as hash map or hash map or HashMap
  • HashMap is a collection used to store key-value pairs, each key-value pair is also called Entry, and these Entry are stored in a In an array, this array is a HashMap.
  • Step S110 Perform name and word frequency mapping according to the Malay name and word frequency mapping table.
  • the word frequency data of a single name is obtained according to the Malay name word frequency mapping table.
  • Step S112. Build a minimum heap.
  • the min heap refers to a sorted complete binary tree, in which the data value of any non-terminal node is not greater than the value of its left and right child nodes.
  • a min-heap is usually used to find N minimum values.
  • Step S114 Take the 2 words with the smallest number of mappings.
  • Step S116 Synthesize normalized name text.
  • word merging is performed to obtain the normalized name text.
  • the normalization method in the above-mentioned related embodiments is based on the assumption that the differences in Malay names are relatively large, and should be understood from common sense, but the above methods have the following problems:
  • Malay names are not necessarily a certain number of words after normalization. For example, some names are normalized with 2 words, some names may be 3 words after normalization, and some have It may be 4 words, so the normalized result obtained by taking only 2 or only a fixed number of words will have insufficient flexibility and inaccurate normalization.
  • the present exemplary embodiment first provides a method for standardizing text.
  • the standardization processing method of the above text may include the following steps:
  • Step S210 Obtain the original information text, where the original information text includes the original text to be processed.
  • Step S220 Match the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.
  • Step S230 Perform word segmentation processing on the target text to obtain each effective text component contained in the target text.
  • Step S240 Obtain a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each valid text component as standard text components.
  • Step S250 Obtain standardized text corresponding to the original text according to the standard text components.
  • the text standardization processing method on the one hand, by matching the original information text through a pre-generated information text thesaurus, the synonyms of the original text in the original information text can be found, so that the correct and wrong Written, abbreviated, reversed, and co-written texts are mined to improve the overall recall rate of the text and the accuracy of standardized processing.
  • the synonyms of the original text in the original information text can be found, so that the correct and wrong Written, abbreviated, reversed, and co-written texts are mined to improve the overall recall rate of the text and the accuracy of standardized processing.
  • text rules can be discovered, manual participation in the processing process can be reduced, and processing efficiency can be improved.
  • the normalization processing method of text in the exemplary embodiment of the present disclosure performs normalization processing on the original text, which can greatly improve the computability and relevance of text data in the multi-source big data scenario. In the process, the efficiency of text data statistics and management can be further improved.
  • step S210 the original information text is obtained, and the original information text includes the original text to be processed.
  • the original information text refers to a complete text including the original text to be processed and some data information corresponding to the original text, where the original text to be processed is the text that needs to be standardized.
  • the original text to be processed may be name text or address text, etc.
  • the original information text may be the complete text of the name, date of birth and gender including the name text, and the data information corresponding to the original text is Date of birth and gender.
  • step S220 the original information text is matched according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.
  • the data source of the text is relatively complex, and words may have a series of problems such as misspelling, abbreviation, and joint writing, in order to improve the accuracy and recall rate of text normalization, it is necessary to generate information text synonyms in advance Dictionary, finds text synonyms on the full amount of data, and converts the original text with synonyms into the corresponding target text.
  • the target text refers to the unified target text converted into each group of synonymous texts.
  • the original information text is matched according to the pre-generated information text thesaurus
  • the target text is used as the target text corresponding to the original text
  • the original text is used as the target text.
  • the method for generating the thesaurus of information text may specifically include the following steps:
  • Step S310 Obtain historical information text, historical text contained in the historical information text, and data information corresponding to the historical text.
  • the historical information text containing the historical text is obtained from the historical data, and the data information corresponding to the historical text in the historical information text is obtained.
  • the historical text may be, for example, historical name text, and the data information corresponding to the historical text may be, for example, gender and date of birth data corresponding to the historical name.
  • Step S320 According to the historical text and the data information corresponding to the historical text, classify the historical information text to obtain multiple sets of similar information texts.
  • the historical information text is classified to obtain multiple sets of similar information texts, which may specifically include the following steps:
  • Step S410 Obtain the first classification identifier of the historical information text according to the data information corresponding to the historical text.
  • the first classification identifier refers to the classification identifier used when classifying the historical information text for the first time.
  • the first classification identifier may be generated according to the gender and date of birth data corresponding to the historical name.
  • Step S420 Classify the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each of the first classification sets are the same.
  • Step S430 Obtain the second classification identifier of the historical information text according to the historical text, and classify the historical information text in each first classification set by a preset clustering algorithm according to the second classification identifier, and obtain a plurality of second classifications. gather.
  • the second classification is performed on the historical information text in each first classification set according to the second classification identifier.
  • the second classification identifier may be generated according to historical text, for example, the second classification identifier may be generated according to historical names.
  • K-Means K-means clustering algorithm
  • K-Means is the most commonly used clustering algorithm. The biggest feature of the algorithm is that it is simple, easy to understand, and fast in operation. Before clustering, it is necessary to specify the number of clusters to be classified.
  • the method for classifying the historical information texts in the first classification set by using a preset clustering algorithm may specifically be: according to the total number of historical information texts in each first classification set, determine the corresponding The number of clusters; according to the second classification identifier, the historical information text in each first classification set is divided into a plurality of second classification sets corresponding to the number of clusters by a preset clustering algorithm.
  • the method for determining the number of clusters corresponding to each first classification set may be: if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, then determine the first classification set according to the total number of historical information texts and a preset ratio. The number of clusters corresponding to a classification set; if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold, the preset number of clusters is obtained as the number of clusters corresponding to the first classification set.
  • Step S440 Obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and reclassify the historical information texts in each of the second classification sets according to the aggregated identifiers to obtain a plurality of third classification sets.
  • a new aggregation identifier can be generated according to the first classification identifier and the second classification identifier, and then the historical information text in each second classification set is classified for the third time according to the aggregate identifier, and the Aggregate IDs of the same historical information text data are aggregated together.
  • Step S450 For the historical information text in each third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and put the historical information text whose cosine similarity is greater than the first similarity threshold. into the same collection of similar information texts.
  • the historical information texts After the second classification of the historical information texts, the historical information texts have been divided into as many categories as possible. Calculate the cosine similarity to get each group of synonyms in the thesaurus. For example, if the cosine similarity of two historical texts is greater than 0.97, they are put into the same set of similar information texts.
  • cosine similarity is to evaluate the similarity of two vectors by calculating the cosine value of the angle between them, which can be applied to the calculation of text similarity.
  • Step S330 Generate an information text thesaurus according to multiple sets of similar information text sets.
  • an information text thesaurus is generated according to multiple sets of similar information texts, which is used for the conversion of historical information text synonyms.
  • the text data can be marked first by deep learning, and then the corresponding synonymous text can be calculated by using the deep learning related algorithm, so as to achieve the same conversion effect.
  • step S230 word segmentation processing is performed on the target text to obtain each valid text component contained in the target text.
  • the target text before performing word segmentation processing on the target text, the target text may be preprocessed first, and the specific method may be: filtering invalid text components in the target text; performing word segmentation processing on the filtered target text , to obtain each effective text component contained in the target text.
  • the preprocessing process can include clearing some special characters, such as ")", “(", ".” and other symbols; in addition, it also needs to clear some meaningless special words, such as "unknown”, “B/O", "Baby” of” and other words.
  • step S240 a pre-generated text component rule set is acquired, and a valid text component of each valid text component that does not belong to the text component rule set is used as a standard text component.
  • the text components that do not need to be normalized in the effective text components can be deleted, and only a part of the effective text components required for normalization are left.
  • the pre-generated text component rule set the variable words in Malay names can be deleted, and only the fixed words are left, which is the last normalization method.
  • the words used i.e. the standard text components.
  • the generation method of the text component rule set can specifically include the following steps:
  • Step S510 Obtain the historical text contained in the historical information text.
  • the contained historical text in the historical information text such as the historical name text, is obtained.
  • Step S520 Perform word segmentation on the historical text to obtain each effective historical text component contained in the historical text.
  • the target text Before performing word segmentation processing on the target text, the target text can also be preprocessed to remove some special characters and meaningless special words, so as to obtain each effective historical text component contained in the historical text.
  • Step S530 Calculate the cosine similarity between the valid historical text components and the text components in the text component rule set.
  • Step S540 If the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than the second similarity threshold, add the valid historical text components to the text component rule set.
  • the valid historical text component can be marked and added to the text component rule set.
  • step S250 the standardized text corresponding to the original text is obtained according to the standard text components.
  • any number of words can be adaptively normalized to represent the core normalized part of the original text, instead of artificially specifying the number of words.
  • FIG. 6 is a complete flow chart of text normalization processing in a specific embodiment of the present disclosure, which can be applied to the normalization of Malay name texts, and is an example of the above steps in this exemplary embodiment.
  • the specific steps of the flowchart are as follows:
  • Step S602. Obtain the Malay name, date of birth and gender text.
  • Step S604. Determine whether the name and date of birth and gender text are in the thesaurus of name and date of birth and gender.
  • step S606 By comparing the name birth date and gender with the name gender birthday name thesaurus, if the name birth date gender text is in the name birth date gender thesaurus, then go to step S606, use the synonym; if not, go to step S606 S608, use the original name word.
  • Step S606. Convert the name text into synonymous name text.
  • Step S608. Name text preprocessing.
  • the preprocessing process can include clearing some special characters, such as ")", “(", ".” and other symbols; in addition, it also needs to clear some meaningless special words, such as "unknown”, “B/O", "Baby” of” and other words.
  • Step S610 Name text segmentation.
  • Step S612. Obtain a name word list.
  • Step S614. Obtain a name rule set.
  • Step S616 Match the name word list with the name rule set.
  • Step S618 Determine whether the name word is in the name rule set.
  • Step S620 Obtain the reserved name word list.
  • Step S622. Obtain the normalized standardized name text.
  • the final reserved name word list is sequentially merged to obtain the normalized normalized name text.
  • FIG. 7 is a complete flowchart of generating an information text thesaurus according to an embodiment of the present disclosure, and the information text thesaurus is the name, date of birth, and gender thesaurus in the above step S604.
  • the specific steps of the flow chart are as follows:
  • Step S702. Acquire full data.
  • Step S704. Generate a first category ID according to the date of birth and gender.
  • an ID is generated according to the gender and date of birth, that is, the first category ID.
  • Step S706 Aggregate the data according to the first classification ID.
  • the data is aggregated according to the first classification ID, that is, the same IDs are aggregated together.
  • Step S708. Classify the aggregated data according to the second classification ID.
  • the second category ID is generated according to the name.
  • the data aggregated by the first classification ID is then classified by name to obtain the second classification ID.
  • the classification algorithm used is the Kmeans algorithm.
  • the strategy for generating classification clusters is that when the number of name lists is greater than 2, take Two-thirds of the number of name lists is used as a classification cluster; when the number of name lists is less than or equal to 2, the classification cluster is set to 1.
  • the purpose of this strategy is mainly to divide the data into multiple classes as much as possible, in order to reduce the number of computations as much as possible and improve the computational efficiency in the subsequent calculation of similarity.
  • Step S710 Generate an aggregate ID according to the first category ID and the second category ID.
  • a new aggregate ID ie, NID, is generated according to the first category ID and the second category ID.
  • Step S712. Aggregate the data according to the aggregation ID.
  • Step S714. Calculate the aggregated data according to the similarity of names.
  • Step S716 Determine whether the name similarity is greater than 0.97.
  • Step S718 Manually confirm similar data.
  • Step S720 Generate a thesaurus of name, date of birth and gender.
  • FIG. 8 is a complete flowchart of generating a text component rule set in an embodiment of the present disclosure, where the text component rule set is the name rule set in the above step S616. The specific steps of the flow chart are as follows:
  • Step S802. Obtain the Malay name text.
  • Step S804. Name text preprocessing.
  • Malay names can include removing some special characters, such as ”)", “(", ".” and other symbols; in addition, it also needs to remove some meaningless special words, such as "unknown”, “ B/O", "Baby of” and other words.
  • Step S806 Name text segmentation.
  • Step S808 Obtain a name rule set.
  • Step S810 Compare the similarity between the name text word segmentation and the name rule set.
  • Step S812. Determine whether the similarity is greater than 0.95.
  • step S814 If the similarity is greater than 0.95, it is considered as a possible rule set, and the process goes to step S814.
  • Step S816 Determine whether the requirements are met, and if so, add the word segmentation of the name text to the name rule set.
  • the present disclosure also provides a text standardization processing device.
  • the text standardization processing apparatus may include an original information text acquisition module 910 , an original information text matching module 920 , a valid text component acquisition module 930 , a standard text component determination module 940 and a normalized text generation module 950 . in:
  • the original information text obtaining module 910 is configured to execute obtaining the original information text, and the original information text includes the original text to be processed;
  • the original information text matching module 920 is configured to perform matching on the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text;
  • the effective text component acquisition module 930 is configured to perform word segmentation processing on the target text to obtain each effective text component contained in the target text;
  • the standard text component determination module 940 is configured to execute the acquisition of a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set among the valid text components as standard text components;
  • the normalized text generation module 950 is configured to obtain normalized text corresponding to the original text according to the standard text components.
  • the original information text matching module 920 may include a first target text determination unit and a second target text determination unit. in:
  • the first target text determining unit is configured to execute, if there is a target information text related to the original information text in the information text thesaurus, then use the target text contained in the target information text as the target text corresponding to the original text;
  • the second target text determination unit is configured to perform, if there is no target information text related to the original information text in the information text thesaurus, taking the original text as the target text.
  • the valid text component obtaining module 930 may include an invalid component filtering unit and a target text word segmentation unit. in:
  • the invalid component filtering unit is configured to perform filtering processing of invalid text components in the target text
  • the target text word segmentation unit is configured to perform word segmentation processing on the filtered target text to obtain each valid text component contained in the target text.
  • the apparatus for standardizing text provided by the present disclosure may further include an information text thesaurus generating module. in:
  • the information text thesaurus generating module may include a historical information text acquisition unit, a historical information text classification unit, and a thesaurus generating unit.
  • the historical information text acquisition unit is configured to perform acquisition of historical information text, historical text contained in the historical information text, and data information corresponding to the historical text;
  • the historical information text classification unit is configured to perform classification of the historical information text according to the historical text and the data information corresponding to the historical text to obtain multiple sets of similar information texts;
  • the thesaurus generating unit is configured to perform generating an informative text thesaurus from a plurality of sets of similar informative text sets.
  • the historical information text classification unit may include a first classification identification determination unit, a first classification set determination unit, a second classification set determination unit, a third classification set determination unit, and a cosine similarity calculation unit unit. in:
  • the first classification identification determining unit is configured to obtain the first classification identification of the historical information text according to the data information corresponding to the historical text;
  • the first classification set determining unit is configured to perform classifying the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each first classification set are the same;
  • the second classification set determining unit is configured to perform obtaining a second classification identification of the historical information text according to the historical text, and to classify the historical information text in each first classification set again according to the second classification identification through a preset clustering algorithm. , to obtain multiple second classification sets;
  • the third classification set determining unit is configured to obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and to reclassify the historical information texts in each of the second classification sets according to the aggregated identifier to obtain a plurality of third classifications gather;
  • the cosine similarity calculation unit is configured to perform, for each historical information text in the third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and set the cosine similarity greater than the first similarity. Thresholded historical infotexts are put into the same set of similar infotexts.
  • the second classification set determination unit may include a cluster number determination unit and an information text division unit. in:
  • the cluster number determination unit is configured to determine the number of clusters corresponding to each first classification set according to the total number of historical information texts in each first classification set;
  • the information text dividing unit is configured to divide the historical information texts in each first classification set into a plurality of second classification sets corresponding to the number of clusters by using a preset clustering algorithm according to the second classification identification.
  • the cluster number determination unit may include a first cluster number determination unit and a second cluster number determination unit. in:
  • the first cluster number determination unit is configured to execute, if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, determine the cluster corresponding to the first classification set according to the total number of historical information texts and the preset ratio. number of clusters;
  • the second cluster number determination unit is configured to obtain a preset number of clusters as the number of clusters corresponding to the first classification set if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold.
  • the apparatus for normalizing text provided by the present disclosure may further include a text component rule set generating module. in:
  • the text component rule set generation module may include a historical text acquisition unit, a valid text component acquisition unit, a cosine similarity calculation unit, and a rule set generation unit.
  • the historical text acquisition unit is configured to perform acquisition of historical text contained in the historical information text
  • the effective text component acquisition unit is configured to perform word segmentation processing on the historical text to obtain each effective historical text component contained in the historical text;
  • the cosine similarity calculation unit is configured to perform cosine similarity calculation between the valid historical text components and the text components in the text component rule set;
  • the rule set generation unit is configured to perform adding the valid historical text components to the text component rule set if the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than a second similarity threshold.
  • FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present invention.
  • a computer system 1000 includes a central processing unit (CPU) 1001, which can be loaded into a random access memory (RAM) 1003 according to a program stored in a read only memory (ROM) 1002 or a program from a storage section 1008 Instead, various appropriate actions and processes are performed.
  • RAM random access memory
  • ROM read only memory
  • various programs and data required for system operation are also stored.
  • the CPU 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to the bus 1004 .
  • the following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1008 including a hard disk, etc. ; and a communication section 1009 including a network interface card such as a LAN card, a modem, and the like. The communication section 1009 performs communication processing via a network such as the Internet.
  • a drive 1010 is also connected to the I/O interface 1005 as needed.
  • a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1010 as needed so that a computer program read therefrom is installed into the storage section 1008 as needed.
  • embodiments of the present invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication portion 1009, and/or installed from the removable medium 1011.
  • CPU central processing unit
  • the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • the present disclosure also provides a computer-readable medium.
  • the computer-readable medium may be included in the electronic device described in the above embodiments; it may also exist alone without being assembled into the electronic device. middle.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by an electronic device, causes the electronic device to implement the methods described in the above-mentioned embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A standardization processing method and apparatus for text, and an electronic device and a computer-readable medium, which belong to the technical field of data processing. The method comprises: acquiring original information text, wherein the original information text comprises original text to be processed (S210); performing matching on the original information text according to a pre-generated information text synonym dictionary, so as to obtain target text corresponding to the original text in the original information text (S220); performing word segmentation processing on the target text to obtain effective text components included in the target text (S230); acquiring a pre-generated text component rule set, and taking an effective text component, which does not belong to the text component rule set, from among the effective text components as a standard text component (S240); and obtaining, according to the standard text component, standardized text corresponding to the original text (S250). By means of an information text synonym dictionary and a text component rule set, normalization processing is performed on original text to obtain standardized text, such that the efficiency and accuracy of text normalization can be improved.

Description

文本的标准化处理方法、装置、电子设备及计算机介质Standardized processing method, device, electronic device and computer medium for text
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开要求于2020年12月29日提交的申请号为202011594885.4、名称为“文本的标准化处理方法、装置、电子设备及计算机介质”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。This disclosure claims the priority of the Chinese patent application with the application number 202011594885.4 and the title of "Method, Apparatus, Electronic Equipment and Computer Medium for Standardized Processing of Text" filed on December 29, 2020, the entire content of which is approved by Reference is incorporated herein in its entirety.
技术领域technical field
本公开涉及数据处理技术领域,具体而言,涉及一种文本的标准化处理方法、文本的标准化处理装置、电子设备及计算机可读介质。The present disclosure relates to the technical field of data processing, and in particular, to a method for standardizing text, an apparatus for standardizing text, an electronic device, and a computer-readable medium.
背景技术Background technique
由于外文的姓名或地址等文本写法多样,很难有统一的标准,因此,归一化处理得到的结果常常不准确,很多情况下都需要人工进行识别和处理,效率较低。Due to the variety of texts such as names or addresses in foreign languages, it is difficult to have a unified standard. Therefore, the results obtained by normalization are often inaccurate. In many cases, manual identification and processing are required, which is inefficient.
鉴于此,本领域亟需一种能够提高文本归一化的效率和准确率的文本的标准化处理方法。In view of this, there is an urgent need in the art for a text normalization processing method that can improve the efficiency and accuracy of text normalization.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above Background section is only for enhancement of understanding of the background of the present disclosure, and therefore may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.
发明内容SUMMARY OF THE INVENTION
本公开的目的在于提供一种文本的标准化处理方法、文本的标准化处理装置、电子设备及计算机可读介质,进而至少在一定程度上提高文本归一化的效率和准确率。The purpose of the present disclosure is to provide a text normalization processing method, a text normalization processing device, an electronic device, and a computer-readable medium, so as to improve the efficiency and accuracy of text normalization at least to a certain extent.
根据本公开的第一个方面,提供一种文本的标准化处理方法,包括:According to a first aspect of the present disclosure, a method for standardizing text is provided, comprising:
获取原始信息文本,所述原始信息文本中包括待处理的原始文本;Obtain original information text, the original information text includes the original text to be processed;
根据预先生成的信息文本同义词典对所述原始信息文本进行匹配,得到所述原始信息文本中的所述原始文本对应的目标文本;Matching the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;
对所述目标文本进行分词处理,得到所述目标文本中所包含的各个有效文本成分;Perform word segmentation processing on the target text to obtain each valid text component contained in the target text;
获取预先生成的文本成分规则集合,并将各个所述有效文本成分中不属于所述文本成分规则集合的所述有效文本成分作为标准文本成分;Obtaining a pre-generated text component rule set, and using the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;
根据所述标准文本成分得到所述原始文本对应的标准化文本。The standardized text corresponding to the original text is obtained according to the standard text components.
根据本公开的第二方面,提供一种文本的标准化处理装置,包括:According to a second aspect of the present disclosure, there is provided an apparatus for standardizing text, comprising:
原始信息文本获取模块,被配置为执行获取原始信息文本,所述原始信息文本中包括待处理的原始文本;an original information text acquisition module, configured to execute and acquire original information text, the original information text includes the original text to be processed;
原始信息文本匹配模块,被配置为执行根据预先生成的信息文本同义词典对所述原始信息文本进行匹配,得到所述原始信息文本中的所述原始文本对应的目标文本;an original information text matching module, configured to perform matching on the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;
有效文本成分获取模块,被配置为执行对所述目标文本进行分词处理,得到所述目 标文本中所包含的各个有效文本成分;A valid text component acquisition module, configured to perform word segmentation processing on the target text, to obtain each valid text component contained in the target text;
标准文本成分确定模块,被配置为执行获取预先生成的文本成分规则集合,并将各个所述有效文本成分中不属于所述文本成分规则集合的所述有效文本成分作为标准文本成分;a standard text component determination module, configured to execute and acquire a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;
标准化文本生成模块,被配置为执行根据所述标准文本成分得到所述原始文本对应的标准化文本。The standardized text generation module is configured to obtain standardized text corresponding to the original text according to the standard text components.
根据本公开的第三方面,提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的文本的标准化处理方法。According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to Execute the normalization processing method of the text described in any one of the above.
根据本公开的第四方面,提供一种计算机可读介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意一项所述的文本的标准化处理方法。According to a fourth aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the above-described normalization processing methods for text.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1示出了根据本公开的一个相关实施例中马来语姓名归一化的流程示意图;1 shows a schematic flowchart of Malay name normalization according to a related embodiment of the present disclosure;
图2示出了本公开示例实施方式的文本的标准化处理方法的流程示意图;FIG. 2 shows a schematic flowchart of a text standardization processing method according to an exemplary embodiment of the present disclosure;
图3示出了本公开示例实施方式的信息文本同义词典的生成方法的流程示意图;3 shows a schematic flowchart of a method for generating an information text thesaurus according to an exemplary embodiment of the present disclosure;
图4示出了本公开示例实施方式的得到多组相似信息文本集合的流程示意图;4 shows a schematic flowchart of obtaining multiple sets of similar information text sets according to an exemplary embodiment of the present disclosure;
图5示出了本公开示例实施方式的文本成分规则集合的生成方法的流程示意图;5 shows a schematic flowchart of a method for generating a text component rule set according to an exemplary embodiment of the present disclosure;
图6示出了根据本公开的一个具体实施方式中文本的标准化处理方法的流程示意图;6 shows a schematic flowchart of a method for standardizing text in a specific embodiment of the present disclosure;
图7示出了根据本公开的一个具体实施方式中信息文本同义词典的生成方法的流程示意图;7 shows a schematic flowchart of a method for generating an information text thesaurus according to a specific embodiment of the present disclosure;
图8示出了根据本公开的一个具体实施方式中文本成分规则集合的生成方法的流程示意图;8 shows a schematic flowchart of a method for generating a text component rule set according to an embodiment of the present disclosure;
图9示出了本公开示例实施方式的文本的标准化处理装置的框图;FIG. 9 shows a block diagram of a text normalization processing apparatus according to an exemplary embodiment of the present disclosure;
图10示出了适于用来实现本公开实施方式的电子设备的计算机系统的结构示意图。FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式 实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repeated descriptions will be omitted. Some of the block diagrams shown in the figures are functional entities that do not necessarily necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
归一指的是数据标准化的处理方式,不同的数据通过归一(某种算法)处理后,可以使其成为标准一样的数据。数据归一化或者称为数据标准化是NLP(Natural Language Processing,自然语言处理)技术的一个方向,指的是通过NLP技术手段进行数据标准化处理的过程。Normalization refers to the standardization of data. After different data is processed by normalization (a certain algorithm), it can be made into the same standard data. Data normalization or data normalization is a direction of NLP (Natural Language Processing) technology, which refers to the process of data normalization processing through NLP technical means.
对多源的数据计算人员的唯一标识,最典型的就是利用姓名日期性别进行人的唯一标识的计算。以马来语体系为例,马来语体系人员的姓名是比较特殊的,最显著特点就是随着年龄的增长,姓名也会不断地变化。例如,马来语姓名在人成年后就会加上成年的标识;当获得一定社会头衔时,又会加上头衔的标识;当去朝拜某一宗教圣地后,又会在名字上加上朝拜地的标识。这种姓名的变动对于计算人员的唯一标识带来了非常大的挑战。因此,需要对姓名文本进行标准化的处理。For the unique identification of multi-source data calculation personnel, the most typical one is to use the name, date and gender to calculate the unique identification of the person. Taking the Malay language system as an example, the names of personnel in the Malay language system are quite special, and the most notable feature is that with the increase of age, the names will continue to change. For example, the Malay name will add the symbol of adulthood when a person is an adult; when a certain social title is obtained, the symbol of the title will be added; when going to a religious holy place, it will be added to the name. place identification. This kind of name change brings great challenges to the unique identification of computing personnel. Therefore, it is necessary to standardize the name text.
马来语体系的姓名是由头衔+重名+第一名字+冠名+父头衔+父重名+父第一名字几部分构成,其中头衔、重名、冠名、父头衔、父重名都是可变的部分,随着时间的变化都有可能发生变化。因此,对于马来语姓名的归一化,指的是通过技术手段去掉马来语姓名中可变的部分,只保留固定或不可变的部分。The name of the Malay system is composed of title + duplicate name + first name + title + parent title + parent duplicate name + parent first name, including title, duplicate name, title, parent title, parent duplicate name They are all variable parts and may change over time. Therefore, for the normalization of Malay names, it refers to removing the variable parts of Malay names by technical means, leaving only the fixed or immutable parts.
在一些相关的实施例中,以马来语姓名文本的归一化为例,可以通过如图1所示的马来语姓名归一化的完整流程图来实现,该流程图的具体步骤如下:In some related embodiments, taking the normalization of Malay name text as an example, it can be realized by a complete flowchart of Malay name normalization as shown in FIG. 1 , and the specific steps of the flowchart are as follows :
步骤S102.获取马来语姓名文本。Step S102. Obtain the Malay name text.
步骤S104.姓名文本预处理。Step S104. Name text preprocessing.
在对马来语姓名进行归一化时候,首先需要对姓名文本进行预处理,预处理过程包含清洗一些特殊字符,比如“)”、“(”、“。”等符号;另外,还需要清除一些无意义的特殊单词,例如“unknown”、“B/O”、“Baby of”等单词。When normalizing Malay names, the name text needs to be preprocessed first. The preprocessing process includes cleaning some special characters, such as ")", "(", "." and other symbols; in addition, it is also necessary to remove Some meaningless special words, such as "unknown", "B/O", "Baby of", etc.
步骤S106.姓名文本分词。Step S106. Name text segmentation.
可以按照空格进行单词的切分。Words can be split according to spaces.
步骤S108.获取马来语姓名单词次数映射表。Step S108. Obtain the Malay name and word frequency mapping table.
马来语姓名单词次数映射表是通过统计历史文本数据中马来语姓名中的单词出现的次数,并根据统计结果构建一个单词次数的HASH(哈希)映射关系,其中,次数指的是所有姓名单词出现的总次数。上述马来语姓名单词次数映射表可以作为姓名归一的基础词典。The Malay name and word count mapping table is to count the number of occurrences of the word in the Malay name in the historical text data, and build a HASH (Hash) mapping relationship of the number of words according to the statistical result, where the number of times refers to all The total number of occurrences of the name word. The above-mentioned Malay name word frequency mapping table can be used as a basic dictionary for name unification.
其中,HASH映射又称哈希映射或散列图或HashMap,是一个用于储存键—值对(key-value)的集合,每个键—值对又称为Entry,将这些Entry储存在一个数组里,这个数组就是HashMap。Among them, HASH map, also known as hash map or hash map or HashMap, is a collection used to store key-value pairs, each key-value pair is also called Entry, and these Entry are stored in a In an array, this array is a HashMap.
步骤S110.根据马来语姓名单词次数映射表进行姓名单词次数映射。Step S110. Perform name and word frequency mapping according to the Malay name and word frequency mapping table.
对姓名文本分词后的单词,根据马来语姓名单词次数映射表得到单姓名的单词次数数据。For the words after the word segmentation of the name text, the word frequency data of a single name is obtained according to the Malay name word frequency mapping table.
步骤S112.构建最小堆。Step S112. Build a minimum heap.
根据单词次数数据构建最小堆。其中,最小堆指的是一种经过排序的完全二叉树,其中任一非终端节点的数据值均不大于其左子节点和右子节点的值。使用最小堆通常是为了求N个最小值。Build a min-heap from word count data. Among them, the min heap refers to a sorted complete binary tree, in which the data value of any non-terminal node is not greater than the value of its left and right child nodes. A min-heap is usually used to find N minimum values.
步骤S114.取映射次数最小的2个单词。Step S114. Take the 2 words with the smallest number of mappings.
步骤S116.合成归一化姓名文本。Step S116. Synthesize normalized name text.
最后进行单词合并,得到归一化姓名文本。Finally, word merging is performed to obtain the normalized name text.
通过对马来语姓名归一化得到马来语姓名核心不变部分,可以在多源大数据中通过归一化后的姓名出生日期性别来对人进行唯一标识。By normalizing Malay names to obtain the core invariant part of Malay names, people can be uniquely identified in multi-source big data through the normalized names, dates of birth, and genders.
上述相关实施例中的归一化方法是建立在马来语姓名取名差异性都比较大的假设条件下,从常识来理解也应该如此,但上述方法存在如下的一些问题:The normalization method in the above-mentioned related embodiments is based on the assumption that the differences in Malay names are relatively large, and should be understood from common sense, but the above methods have the following problems:
一方面,马来语姓名归一化后不一定是确定个数的单词,比如有的名字归一化后是2个单词,有的名字归一化后可能是3个单词,有的又有可能是4个单词,因此这种只取2个或者只取固定数量的单词得到的归一化后结果,会存在灵活性不够以及归一化后不准确的情况。On the one hand, Malay names are not necessarily a certain number of words after normalization. For example, some names are normalized with 2 words, some names may be 3 words after normalization, and some have It may be 4 words, so the normalized result obtained by taking only 2 or only a fixed number of words will have insufficient flexibility and inaccurate normalization.
另一方面,为了解决一些姓名归一化错误的情况,可能需要调整一些单词的优先级。若手动调整单词的优先级,就会影响后续将手动调整的优先级单词跟自动构建的单词合并的情况;另外,调整了单词的优先级在一些姓名的归一化过程中可能会出现错误,因此上述方案的适用性或者泛化性不够。On the other hand, in order to address some cases of name normalization errors, it may be necessary to adjust the priority of some words. If the priority of a word is adjusted manually, it will affect the subsequent merging of the manually adjusted priority word with the automatically constructed word; in addition, the adjusted word priority may cause errors in the normalization process of some names. Therefore, the applicability or generalization of the above scheme is not enough.
基于上述问题,本示例实施方式首先提供了一种文本的标准化处理方法。参考图2所示,上述文本的标准化处理方法可以包括以下步骤:Based on the above problems, the present exemplary embodiment first provides a method for standardizing text. Referring to Fig. 2, the standardization processing method of the above text may include the following steps:
步骤S210.获取原始信息文本,原始信息文本中包括待处理的原始文本。Step S210. Obtain the original information text, where the original information text includes the original text to be processed.
步骤S220.根据预先生成的信息文本同义词典对原始信息文本进行匹配,得到原始信息文本中的原始文本对应的目标文本。Step S220. Match the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.
步骤S230.对目标文本进行分词处理,得到目标文本中所包含的各个有效文本成分。Step S230. Perform word segmentation processing on the target text to obtain each effective text component contained in the target text.
步骤S240.获取预先生成的文本成分规则集合,并将各个有效文本成分中不属于文本成分规则集合的有效文本成分作为标准文本成分。Step S240. Obtain a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each valid text component as standard text components.
步骤S250.根据标准文本成分得到原始文本对应的标准化文本。Step S250. Obtain standardized text corresponding to the original text according to the standard text components.
本公开示例实施方式的文本的标准化处理方法中,一方面,通过预先生成的信息文本同义词典对原始信息文本进行匹配,可以对原始信息文本中的原始文本的同义词进行发现,从而能够对错写、缩写、反写、联写的文本进行挖掘,提升文本的整体召回率与标准化处理的准确率。另一方面,通过预先生成的文本成分规则集合对原始文本中的各个有效文本成分进行匹配,可以对文本规则进行发现,减少处理过程中的人工参与,进而提高处理效率。最后,本公开示例实施方式中的文本的标准化处理方法对原始文本进行归一化处理,能够大幅度地提升在多源大数据场景下文本数据的可计算性与关联性,在后续文本的使用过程中,可以进一步提高文本数据统计和管理的效率。In the text standardization processing method according to the exemplary embodiment of the present disclosure, on the one hand, by matching the original information text through a pre-generated information text thesaurus, the synonyms of the original text in the original information text can be found, so that the correct and wrong Written, abbreviated, reversed, and co-written texts are mined to improve the overall recall rate of the text and the accuracy of standardized processing. On the other hand, by matching each valid text component in the original text through a pre-generated text component rule set, text rules can be discovered, manual participation in the processing process can be reduced, and processing efficiency can be improved. Finally, the normalization processing method of text in the exemplary embodiment of the present disclosure performs normalization processing on the original text, which can greatly improve the computability and relevance of text data in the multi-source big data scenario. In the process, the efficiency of text data statistics and management can be further improved.
下面,结合图3至图5对本示例实施方式的上述步骤进行更加详细的说明。Hereinafter, the above steps of this exemplary embodiment will be described in more detail with reference to FIGS. 3 to 5 .
在步骤S210中,获取原始信息文本,原始信息文本中包括待处理的原始文本。In step S210, the original information text is obtained, and the original information text includes the original text to be processed.
本示例实施方式中,原始信息文本指的是包含待处理的原始文本以及原始文本所对应的一些数据信息的完整文本,其中,待处理的原始文本是需要进行标准化处理的文本。例如,待处理的原始文本可以为姓名文本或地址文本等,以姓名文本为例,原始信息文本可以为包含姓名文本在内的姓名出生日期和性别的完整文本,原始文本所对应的数据信息为出生日期和性别。In this example implementation, the original information text refers to a complete text including the original text to be processed and some data information corresponding to the original text, where the original text to be processed is the text that needs to be standardized. For example, the original text to be processed may be name text or address text, etc. Taking the name text as an example, the original information text may be the complete text of the name, date of birth and gender including the name text, and the data information corresponding to the original text is Date of birth and gender.
在步骤S220中,根据预先生成的信息文本同义词典对原始信息文本进行匹配,得到原始信息文本中的原始文本对应的目标文本。In step S220, the original information text is matched according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.
本示例实施方式中,由于文本的数据来源比较复杂,且单词可能存在错写、缩写、联写等一系列问题,为了提高文本归一化的准确率与召回率,需要预先生成信息文本同义词典,对全量数据进行文本同义词发现,将存在同义词的原始文本转化为对应的目标文本。其中,目标文本指的是将每一组同义的文本转换成的统一的目标文本。In this example implementation, since the data source of the text is relatively complex, and words may have a series of problems such as misspelling, abbreviation, and joint writing, in order to improve the accuracy and recall rate of text normalization, it is necessary to generate information text synonyms in advance Dictionary, finds text synonyms on the full amount of data, and converts the original text with synonyms into the corresponding target text. Among them, the target text refers to the unified target text converted into each group of synonymous texts.
本示例实施方式中,在根据预先生成的信息文本同义词典对原始信息文本进行匹配时,若信息文本同义词典中存在与原始信息文本相关的目标信息文本,则将目标信息文本中包含的目标文本作为原始文本对应的目标文本;若信息文本同义词典中不存在与原始信息文本相关的目标信息文本,则将原始文本作为目标文本。In this example implementation, when the original information text is matched according to the pre-generated information text thesaurus, if there is a target information text related to the original information text in the information text thesaurus, the The target text is used as the target text corresponding to the original text; if there is no target information text related to the original information text in the information text thesaurus, the original text is used as the target text.
例如,将姓名文本与预先生成的姓名性别生日同义词典进行比较,若同义词典中存在与该姓名文本同义的目标姓名文本,则将该姓名文本转换为目标姓名文本;若不存在,则直接使用原始的姓名文本进行后续步骤的处理。For example, compare the name text with the pre-generated name, gender and birthday thesaurus. If there is a target name text that is synonymous with the name text in the thesaurus, convert the name text to the target name text; if not, convert the name text to the target name text. Then directly use the original name text to process the subsequent steps.
本示例实施方式中,如图3所示,信息文本同义词典的生成方法,具体可以包括以下几个步骤:In this exemplary implementation, as shown in FIG. 3 , the method for generating the thesaurus of information text may specifically include the following steps:
步骤S310.获取历史信息文本,历史信息文本中所包含历史文本,以及历史文本对应的数据信息。Step S310. Obtain historical information text, historical text contained in the historical information text, and data information corresponding to the historical text.
首先从历史数据中获取包含历史文本的历史信息文本,并获取历史信息文本中历史文本对应的数据信息。其中,历史文本可例如历史姓名文本,历史文本对应的数据信息可例如历史姓名对应的性别和出生日期数据。First, the historical information text containing the historical text is obtained from the historical data, and the data information corresponding to the historical text in the historical information text is obtained. The historical text may be, for example, historical name text, and the data information corresponding to the historical text may be, for example, gender and date of birth data corresponding to the historical name.
步骤S320.根据历史文本和历史文本对应的数据信息,对历史信息文本进行分类,得到多组相似信息文本集合。Step S320. According to the historical text and the data information corresponding to the historical text, classify the historical information text to obtain multiple sets of similar information texts.
本示例实施方式中,如图4所示,根据历史文本和历史文本对应的数据信息,对历史信息文本进行分类,得到多组相似信息文本集合,具体可以包括以下几个步骤:In this example implementation, as shown in FIG. 4 , according to the historical text and the data information corresponding to the historical text, the historical information text is classified to obtain multiple sets of similar information texts, which may specifically include the following steps:
步骤S410.根据历史文本对应的数据信息得到历史信息文本的第一分类标识。Step S410. Obtain the first classification identifier of the historical information text according to the data information corresponding to the historical text.
第一分类标识指的是对历史信息文本进行第一次分类时使用的分类标识。举例而言,可以先根据历史姓名对应的性别和出生日期数据生成第一分类标识。The first classification identifier refers to the classification identifier used when classifying the historical information text for the first time. For example, the first classification identifier may be generated according to the gender and date of birth data corresponding to the historical name.
步骤S420.根据第一分类标识对历史信息文本进行分类,得到多个第一分类集合,其中,每个第一分类集合中历史信息文本的第一分类标识相同。Step S420. Classify the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each of the first classification sets are the same.
根据第一分类标识对历史信息文本进行第一次聚合分类,将第一分类标识相同的历史信息文本聚合在一起。Perform the first aggregation classification on the historical information texts according to the first classification identifiers, and aggregate the historical information texts with the same first classification identifiers.
步骤S430.根据历史文本得到历史信息文本的第二分类标识,并根据第二分类标识通过预设聚类算法分别对各个第一分类集合中的历史信息文本再次进行分类,得到多个第二分类集合。Step S430. Obtain the second classification identifier of the historical information text according to the historical text, and classify the historical information text in each first classification set by a preset clustering algorithm according to the second classification identifier, and obtain a plurality of second classifications. gather.
对历史信息文本进行第一次分类之后,再对每个第一分类集合中的历史信息文本根据第二分类标识进行第二次分类。其中,第二分类标识可以根据历史文本生成,举例而言,可以根据历史姓名生成第二分类标识。After the first classification of the historical information text, the second classification is performed on the historical information text in each first classification set according to the second classification identifier. The second classification identifier may be generated according to historical text, for example, the second classification identifier may be generated according to historical names.
对历史信息文本进行的第二次分类,可以使用预设的聚类算法,例如K-Means(K均值聚类算法)。K-Means是聚类算法中的最常用的一种,算法最大的特点是简单,好理解,运算速度快,在聚类前需要先指定分类的簇数。For the second classification of the historical information text, a preset clustering algorithm such as K-Means (K-means clustering algorithm) can be used. K-Means is the most commonly used clustering algorithm. The biggest feature of the algorithm is that it is simple, easy to understand, and fast in operation. Before clustering, it is necessary to specify the number of clusters to be classified.
本示例实施方式中,通过预设聚类算法对第一分类集合中的历史信息文本进行分类的方法具体可以为:根据各个第一分类集合中历史信息文本的总数,确定各个第一分类集合对应的聚类簇数;根据第二分类标识,通过预设聚类算法将各个第一分类集合中的历史信息文本划分为与聚类簇数相对应的多个第二分类集合。In this example implementation, the method for classifying the historical information texts in the first classification set by using a preset clustering algorithm may specifically be: according to the total number of historical information texts in each first classification set, determine the corresponding The number of clusters; according to the second classification identifier, the historical information text in each first classification set is divided into a plurality of second classification sets corresponding to the number of clusters by a preset clustering algorithm.
其中,确定各个第一分类集合对应的聚类簇数的方法可以为:若第一分类集合中历史信息文本的总数大于或等于文本数量阈值,则根据历史信息文本的总数和预设比值确定第一分类集合对应的聚类簇数;若第一分类集合中历史信息文本的总数小于或等于文本数量阈值,则获取预设聚类簇数作为第一分类集合对应的聚类簇数。Wherein, the method for determining the number of clusters corresponding to each first classification set may be: if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, then determine the first classification set according to the total number of historical information texts and a preset ratio. The number of clusters corresponding to a classification set; if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold, the preset number of clusters is obtained as the number of clusters corresponding to the first classification set.
例如,当第一分类集合中的历史信息文本的数量大于或等于3时,可以取历史信息文本的数量的三分之二整数作为聚类簇数;当第一分类集合中的历史信息文本的数量小 于3时,可以直接将聚类簇数的值设定为1。For example, when the number of historical information texts in the first classification set is greater than or equal to 3, an integer two-thirds of the number of historical information texts can be taken as the number of clusters; when the number of historical information texts in the first classification set is When the number is less than 3, you can directly set the value of the number of clusters to 1.
步骤S440.根据第一分类标识和第二分类标识得到聚合标识,并根据聚合标识分别对各个第二分类集合中的历史信息文本再次进行分类,得到多个第三分类集合。Step S440. Obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and reclassify the historical information texts in each of the second classification sets according to the aggregated identifiers to obtain a plurality of third classification sets.
对历史信息文本进行第二次分类之后,可以根据第一分类标识和第二分类标识生成新的聚合标识,再根据聚合标识对各个第二分类集合中的历史信息文本进行第三次分类,将聚合标识相同的历史信息文本数据聚合在一起。After the historical information text is classified for the second time, a new aggregation identifier can be generated according to the first classification identifier and the second classification identifier, and then the historical information text in each second classification set is classified for the third time according to the aggregate identifier, and the Aggregate IDs of the same historical information text data are aggregated together.
步骤S450.对于各个第三分类集合中的历史信息文本,计算历史信息文本中所包含的历史文本两两之间的余弦相似度,并将余弦相似度大于第一相似度阈值的历史信息文本放入同一个相似信息文本集合中。Step S450. For the historical information text in each third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and put the historical information text whose cosine similarity is greater than the first similarity threshold. into the same collection of similar information texts.
对历史信息文本进行第二次分类之后,已经将历史信息文本尽可能多地分成多个类,此时,再对各个第三分类集合中的历史信息文本中所包含的历史文本两两之间计算余弦相似度,得到同义词典中的各组同义词。例如,若两个历史文本的余弦相似度大于0.97,则将其放入同一组相似信息文本集合中。After the second classification of the historical information texts, the historical information texts have been divided into as many categories as possible. Calculate the cosine similarity to get each group of synonyms in the thesaurus. For example, if the cosine similarity of two historical texts is greater than 0.97, they are put into the same set of similar information texts.
其中,余弦相似度,又称为余弦相似性,是通过计算两个向量的夹角余弦值来评估它们的相似度,可以应用于文本相似度的计算。Among them, cosine similarity, also known as cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of the angle between them, which can be applied to the calculation of text similarity.
步骤S330.根据多组相似信息文本集合生成信息文本同义词典。Step S330. Generate an information text thesaurus according to multiple sets of similar information text sets.
最后,根据多组相似信息文本集合生成信息文本同义词典,用于历史信息文本同义词的转换。Finally, an information text thesaurus is generated according to multiple sets of similar information texts, which is used for the conversion of historical information text synonyms.
除此之外,还可以通过深度学习的方式来首先标注文本数据,然后运用深度学习的相关算法计算出对应的同义文本,从而可以达到同样的转换效果。In addition, the text data can be marked first by deep learning, and then the corresponding synonymous text can be calculated by using the deep learning related algorithm, so as to achieve the same conversion effect.
在步骤S230中,对目标文本进行分词处理,得到目标文本中所包含的各个有效文本成分。In step S230, word segmentation processing is performed on the target text to obtain each valid text component contained in the target text.
本示例实施方式中,在对目标文本进行分词处理之前,可以先对目标文本进行预处理,具体方法可以为:将目标文本中的无效文本成分进行过滤处理;对过滤之后的目标文本进行分词处理,得到目标文本中所包含的各个有效文本成分。In this example implementation, before performing word segmentation processing on the target text, the target text may be preprocessed first, and the specific method may be: filtering invalid text components in the target text; performing word segmentation processing on the filtered target text , to obtain each effective text component contained in the target text.
预处理过程可以包括清除一些特殊字符,比如“)”、“(”、“。”等符号;另外,还需要清除一些无意义的特殊单词,例如“unknown”、“B/O”、“Baby of”等单词。The preprocessing process can include clearing some special characters, such as ")", "(", "." and other symbols; in addition, it also needs to clear some meaningless special words, such as "unknown", "B/O", "Baby" of" and other words.
在步骤S240中,获取预先生成的文本成分规则集合,并将各个有效文本成分中不属于文本成分规则集合的有效文本成分作为标准文本成分。In step S240, a pre-generated text component rule set is acquired, and a valid text component of each valid text component that does not belong to the text component rule set is used as a standard text component.
通过预先生成的文本成分规则集合,可以将有效文本成分中不需要进行归一化的文本成分删除,只留下归一化所需的一部分有效文本成分。以马来语姓名的归一化为例,通过预先生成的文本成分规则集合,可以将马来语姓名中可变的部分单词删除,只留下固定不变的单词,最为最后归一化所使用的单词,即标准文本成分。Through the pre-generated text component rule set, the text components that do not need to be normalized in the effective text components can be deleted, and only a part of the effective text components required for normalization are left. Taking the normalization of Malay names as an example, through the pre-generated text component rule set, the variable words in Malay names can be deleted, and only the fixed words are left, which is the last normalization method. The words used, i.e. the standard text components.
本示例实施方式中,如图5所示,文本成分规则集合的生成方法,具体可以包括以 下几个步骤:In this example embodiment, as shown in Figure 5, the generation method of the text component rule set can specifically include the following steps:
步骤S510.获取历史信息文本中的所包含历史文本。Step S510. Obtain the historical text contained in the historical information text.
首先,获取历史信息文本中的所包含历史文本,例如历史姓名文本。First, the contained historical text in the historical information text, such as the historical name text, is obtained.
步骤S520.对历史文本进行分词处理,得到历史文本中所包含的各个有效历史文本成分。Step S520. Perform word segmentation on the historical text to obtain each effective historical text component contained in the historical text.
在对目标文本进行分词处理之前,也可以先对目标文本进行预处理,清除一些特殊字符和无意义的特殊单词,得到历史文本中所包含的各个有效历史文本成分。Before performing word segmentation processing on the target text, the target text can also be preprocessed to remove some special characters and meaningless special words, so as to obtain each effective historical text component contained in the historical text.
步骤S530.将有效历史文本成分与文本成分规则集合中的文本成分进行余弦相似度计算。Step S530. Calculate the cosine similarity between the valid historical text components and the text components in the text component rule set.
然后将有效历史文本成分与文本成分规则集合中已有的文本成分计算余弦相似度。Then the cosine similarity is calculated between the effective historical text components and the existing text components in the text component rule set.
步骤S540.若有效历史文本成分与文本成分规则集合中的文本成分之间的余弦相似度大于第二相似度阈值,则将有效历史文本成分添加到文本成分规则集合中。Step S540. If the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than the second similarity threshold, add the valid historical text components to the text component rule set.
举例而言,可以当有效历史文本成分与文本成分规则集合中的任意一个文本成分的余弦相似度大于0.95的时候,就可以对该有效历史文本成分进行标注,并补充到文本成分规则集合中。For example, when the cosine similarity between the valid historical text component and any text component in the text component rule set is greater than 0.95, the valid historical text component can be marked and added to the text component rule set.
在步骤S250中,根据标准文本成分得到原始文本对应的标准化文本。In step S250, the standardized text corresponding to the original text is obtained according to the standard text components.
最后,根据最终保留的标准文本成分按照原本的顺序进行顺序合并,得到原始文本对应的标准化文本。Finally, according to the final reserved standard text components, they are merged in the original order to obtain the standard text corresponding to the original text.
通过本示例实施方式中的文本的标准化处理方法,可以自适应地归一出任意单词量来表示原始文本的核心归一化部分,而不是人为的指定单词数量。With the text normalization processing method in this example embodiment, any number of words can be adaptively normalized to represent the core normalized part of the original text, instead of artificially specifying the number of words.
如图6所示是本公开的一个具体实施方式中文本的标准化处理的完整流程图,可以应用于马来语姓名文本的归一化,是对本示例实施方式中的上述步骤的举例说明,该流程图的具体步骤如下:As shown in FIG. 6 is a complete flow chart of text normalization processing in a specific embodiment of the present disclosure, which can be applied to the normalization of Malay name texts, and is an example of the above steps in this exemplary embodiment. The specific steps of the flowchart are as follows:
步骤S602.获取马来语姓名出生日期性别文本。Step S602. Obtain the Malay name, date of birth and gender text.
步骤S604.判断该姓名出生日期性别文本是否在姓名出生日期性别同义词典中。Step S604. Determine whether the name and date of birth and gender text are in the thesaurus of name and date of birth and gender.
通过姓名出生日期性别跟姓名性别生日姓名同义词典进行比较,若该姓名出生日期性别文本在姓名出生日期性别同义词典中,则进入步骤S606,使用同义姓名词;若否,则进入步骤S608,使用原姓名词。By comparing the name birth date and gender with the name gender birthday name thesaurus, if the name birth date gender text is in the name birth date gender thesaurus, then go to step S606, use the synonym; if not, go to step S606 S608, use the original name word.
步骤S606.将姓名文本转换为同义姓名文本。Step S606. Convert the name text into synonymous name text.
步骤S608.姓名文本预处理。Step S608. Name text preprocessing.
预处理过程可以包括清除一些特殊字符,比如“)”、“(”、“。”等符号;另外,还需要清除一些无意义的特殊单词,例如“unknown”、“B/O”、“Baby of”等单词。The preprocessing process can include clearing some special characters, such as ")", "(", "." and other symbols; in addition, it also needs to clear some meaningless special words, such as "unknown", "B/O", "Baby" of" and other words.
步骤S610.姓名文本分词。Step S610. Name text segmentation.
步骤S612.获取姓名单词列表。Step S612. Obtain a name word list.
步骤S614.获取姓名规则集。Step S614. Obtain a name rule set.
步骤S616.将姓名单词列表与姓名规则集匹配。Step S616. Match the name word list with the name rule set.
步骤S618.判断姓名单词是否在姓名规则集中。Step S618. Determine whether the name word is in the name rule set.
将姓名分词列表与姓名规则集进行匹配,若姓名单词不在姓名规则集中,则保留,进入步骤S620;若姓名单词在姓名规则集中,则丢弃该姓名单词。Match the name word segmentation list with the name rule set. If the name word is not in the name rule set, keep it and go to step S620; if the name word is in the name rule set, discard the name word.
步骤S620.得到保留姓名单词列表。Step S620. Obtain the reserved name word list.
步骤S622.得到归一后的标准化姓名文本。Step S622. Obtain the normalized standardized name text.
将最终得到的保留姓名单词列表进行顺序合并,得到归一化后的标准化姓名文本。The final reserved name word list is sequentially merged to obtain the normalized normalized name text.
由于姓名数据来源比较复杂且马来语姓名比较特殊,以及英文单词存在错写、缩写、联写等一些列问题,为了提高姓名归一的准确率与召回率,需要对全量数据进行姓名同义词发现。如图7所示是本公开的一个具体实施方式中生成信息文本同义词典的完整流程图,该信息文本同义词典即上述步骤S604中的姓名出生日期性别同义词典。该流程图的具体步骤如下:Due to the complex source of name data and the special Malay names, as well as a series of problems such as misspellings, abbreviations, and joint writing of English words, in order to improve the accuracy and recall rate of name normalization, it is necessary to perform name synonym discovery on the full amount of data. . FIG. 7 is a complete flowchart of generating an information text thesaurus according to an embodiment of the present disclosure, and the information text thesaurus is the name, date of birth, and gender thesaurus in the above step S604. The specific steps of the flow chart are as follows:
步骤S702.获取全量数据。Step S702. Acquire full data.
步骤S704.按照出生日期性别生成第一分类ID。Step S704. Generate a first category ID according to the date of birth and gender.
对多源的全量数据中存在姓名出生日期性别的数据,按照性别与出生日期生成一个ID,即第一分类ID。For the data of name, date of birth, and gender in the full data from multiple sources, an ID is generated according to the gender and date of birth, that is, the first category ID.
步骤S706.按照第一分类ID对数据进行聚合。Step S706. Aggregate the data according to the first classification ID.
按照第一分类ID进行数据聚合,即把相同ID的聚合在一起。The data is aggregated according to the first classification ID, that is, the same IDs are aggregated together.
步骤S708.对聚合后的数据按照第二分类ID进行分类。其中,第二分类ID根据姓名生成。Step S708. Classify the aggregated data according to the second classification ID. Wherein, the second category ID is generated according to the name.
对通过第一分类ID聚合后的数据再通过姓名进行分类,得到第二分类ID,使用的分类算法是Kmeans算法,其中,生成分类簇的策略是,当姓名列表的数量大于2的时候,取姓名列表数量的三分之二整数作为分类簇;当姓名列表的数量小于等于2时,则分类簇设置为1。此策略的目的主要是为了将数据尽可能多的分成多个类,为了在后续计算相似度的时候尽可能减少计算次数,提高计算效率。The data aggregated by the first classification ID is then classified by name to obtain the second classification ID. The classification algorithm used is the Kmeans algorithm. The strategy for generating classification clusters is that when the number of name lists is greater than 2, take Two-thirds of the number of name lists is used as a classification cluster; when the number of name lists is less than or equal to 2, the classification cluster is set to 1. The purpose of this strategy is mainly to divide the data into multiple classes as much as possible, in order to reduce the number of computations as much as possible and improve the computational efficiency in the subsequent calculation of similarity.
步骤S710.根据第一分类ID和第二分类ID生成聚合ID。Step S710. Generate an aggregate ID according to the first category ID and the second category ID.
按照第一分类ID以及第二分类ID生成新的聚合ID,即NID。A new aggregate ID, ie, NID, is generated according to the first category ID and the second category ID.
步骤S712.按照聚合ID对数据进行聚合。Step S712. Aggregate the data according to the aggregation ID.
按照NID进行聚合,把NID相同的数据聚合在一起。Aggregate according to NID, and aggregate data with the same NID together.
步骤S714.对聚合后的数据按照姓名相似度进行计算。Step S714. Calculate the aggregated data according to the similarity of names.
步骤S716.判断姓名相似度是否大于0.97。Step S716. Determine whether the name similarity is greater than 0.97.
若姓名相似度大于0.97,则进入步骤S718的相似数据中。If the similarity of names is greater than 0.97, enter the similarity data in step S718.
步骤S718.对相似数据进行人工确认。Step S718. Manually confirm similar data.
对于一些特殊姓名的情况可以进行人工干预,并且不会影响其他情况的归一化处理。For some special name cases, manual intervention can be performed, and it will not affect the normalization processing of other cases.
步骤S720.生成姓名出生日期性别同义词典。Step S720. Generate a thesaurus of name, date of birth and gender.
由于马来语的姓名是由头衔+重名+第一名字+冠名+父头衔+父重名+父第一名字几项构成,其中头衔、重名、冠名、父头衔、父重名都是可变的部分,随着时间的变化都有可能发生变化。因此,需要整理提取的特征词类型有头衔(包名字中父头衔)、重名和冠名,来形成姓名规则集。在归一化之前,需要对规则集进行发现。如图8所示是本公开的一个具体实施方式中生成文本成分规则集合的完整流程图,该文本成分规则集合即上述步骤S616中的姓名规则集。该流程图的具体步骤如下:Since the name in Malay is composed of title + duplicate name + first name + title + parent title + parent duplicate name + parent first name, including title, duplicate name, title, parent title, parent duplicate name They are all variable parts and may change over time. Therefore, it is necessary to organize the extracted feature word types such as title (parent title in the package name), duplicate name and title name to form a name rule set. Rulesets need to be discovered before normalization. FIG. 8 is a complete flowchart of generating a text component rule set in an embodiment of the present disclosure, where the text component rule set is the name rule set in the above step S616. The specific steps of the flow chart are as follows:
步骤S802.获取马来语姓名文本。Step S802. Obtain the Malay name text.
步骤S804.姓名文本预处理。Step S804. Name text preprocessing.
对马来语姓名进行预处理的过程可以包括清除一些特殊字符,比如“)”、“(”、“。”等符号;另外,还需要清除一些无意义的特殊单词,例如“unknown”、“B/O”、“Baby of”等单词。The process of preprocessing Malay names can include removing some special characters, such as ")", "(", "." and other symbols; in addition, it also needs to remove some meaningless special words, such as "unknown", " B/O", "Baby of" and other words.
步骤S806.姓名文本分词。Step S806. Name text segmentation.
对姓名文本进行分词,得到姓名单词列表。Tokenize the name text to get a list of name words.
步骤S808.获取姓名规则集。Step S808. Obtain a name rule set.
步骤S810.将姓名文本分词与姓名规则集进行相似度比较。Step S810. Compare the similarity between the name text word segmentation and the name rule set.
步骤S812.判断相似度是否大于0.95。Step S812. Determine whether the similarity is greater than 0.95.
若相似度大于0.95,就认为是可能的规则集,进入步骤S814。If the similarity is greater than 0.95, it is considered as a possible rule set, and the process goes to step S814.
步骤S814.人工标注。Step S814. Manual annotation.
对可能的规则集进行人工标注。Manual annotation of possible rule sets.
步骤S816.判断是否符合要求,若是,则将姓名文本分词补充到姓名规则集中。Step S816. Determine whether the requirements are met, and if so, add the word segmentation of the name text to the name rule set.
应当注意,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。It should be noted that although the various steps of the methods of the present disclosure are depicted in the figures in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps must be performed to achieve the desired the result of. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and the like.
进一步的,本公开还提供了一种文本的标准化处理装置。参考图9所示,该文本的标准化处理装置可以包括原始信息文本获取模块910、原始信息文本匹配模块920、有效文本成分获取模块930、标准文本成分确定模块940以及标准化文本生成模块950。其中:Further, the present disclosure also provides a text standardization processing device. Referring to FIG. 9 , the text standardization processing apparatus may include an original information text acquisition module 910 , an original information text matching module 920 , a valid text component acquisition module 930 , a standard text component determination module 940 and a normalized text generation module 950 . in:
原始信息文本获取模块910被配置为执行获取原始信息文本,原始信息文本中包括待处理的原始文本;The original information text obtaining module 910 is configured to execute obtaining the original information text, and the original information text includes the original text to be processed;
原始信息文本匹配模块920被配置为执行根据预先生成的信息文本同义词典对原始信息文本进行匹配,得到原始信息文本中的原始文本对应的目标文本;The original information text matching module 920 is configured to perform matching on the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text;
有效文本成分获取模块930被配置为执行对目标文本进行分词处理,得到目标文本中所包含的各个有效文本成分;The effective text component acquisition module 930 is configured to perform word segmentation processing on the target text to obtain each effective text component contained in the target text;
标准文本成分确定模块940被配置为执行获取预先生成的文本成分规则集合,并将各个有效文本成分中不属于文本成分规则集合的有效文本成分作为标准文本成分;The standard text component determination module 940 is configured to execute the acquisition of a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set among the valid text components as standard text components;
标准化文本生成模块950被配置为执行根据标准文本成分得到原始文本对应的标准化文本。The normalized text generation module 950 is configured to obtain normalized text corresponding to the original text according to the standard text components.
在本公开的一些示例性实施例中,原始信息文本匹配模块920可以包括第一目标文本确定单元以及第二目标文本确定单元。其中:In some exemplary embodiments of the present disclosure, the original information text matching module 920 may include a first target text determination unit and a second target text determination unit. in:
第一目标文本确定单元被配置为执行若信息文本同义词典中存在与原始信息文本相关的目标信息文本,则将目标信息文本中包含的目标文本作为原始文本对应的目标文本;The first target text determining unit is configured to execute, if there is a target information text related to the original information text in the information text thesaurus, then use the target text contained in the target information text as the target text corresponding to the original text;
第二目标文本确定单元被配置为执行若信息文本同义词典中不存在与原始信息文本相关的目标信息文本,则将原始文本作为目标文本。The second target text determination unit is configured to perform, if there is no target information text related to the original information text in the information text thesaurus, taking the original text as the target text.
在本公开的一些示例性实施例中,有效文本成分获取模块930可以包括无效成分过滤单元以及目标文本分词单元。其中:In some exemplary embodiments of the present disclosure, the valid text component obtaining module 930 may include an invalid component filtering unit and a target text word segmentation unit. in:
无效成分过滤单元被配置为执行将目标文本中的无效文本成分进行过滤处理;The invalid component filtering unit is configured to perform filtering processing of invalid text components in the target text;
目标文本分词单元被配置为执行对过滤之后的目标文本进行分词处理,得到目标文本中所包含的各个有效文本成分。The target text word segmentation unit is configured to perform word segmentation processing on the filtered target text to obtain each valid text component contained in the target text.
在本公开的一些示例性实施例中,本公开提供的一种文本的标准化处理装置还可以包括信息文本同义词典生成模块。其中:In some exemplary embodiments of the present disclosure, the apparatus for standardizing text provided by the present disclosure may further include an information text thesaurus generating module. in:
信息文本同义词典生成模块可以包括历史信息文本获取单元、历史信息文本分类单元以及同义词典生成单元。The information text thesaurus generating module may include a historical information text acquisition unit, a historical information text classification unit, and a thesaurus generating unit.
历史信息文本获取单元被配置为执行获取历史信息文本,历史信息文本中所包含历史文本,以及历史文本对应的数据信息;The historical information text acquisition unit is configured to perform acquisition of historical information text, historical text contained in the historical information text, and data information corresponding to the historical text;
历史信息文本分类单元被配置为执行根据历史文本和历史文本对应的数据信息,对历史信息文本进行分类,得到多组相似信息文本集合;The historical information text classification unit is configured to perform classification of the historical information text according to the historical text and the data information corresponding to the historical text to obtain multiple sets of similar information texts;
同义词典生成单元被配置为执行根据多组相似信息文本集合生成信息文本同义词典。The thesaurus generating unit is configured to perform generating an informative text thesaurus from a plurality of sets of similar informative text sets.
在本公开的一些示例性实施例中,历史信息文本分类单元可以包括第一分类标识确定单元、第一分类集合确定单元、第二分类集合确定单元、第三分类集合确定单元以及余弦相似度计算单元。其中:In some exemplary embodiments of the present disclosure, the historical information text classification unit may include a first classification identification determination unit, a first classification set determination unit, a second classification set determination unit, a third classification set determination unit, and a cosine similarity calculation unit unit. in:
第一分类标识确定单元被配置为执行根据历史文本对应的数据信息得到历史信息文本的第一分类标识;The first classification identification determining unit is configured to obtain the first classification identification of the historical information text according to the data information corresponding to the historical text;
第一分类集合确定单元被配置为执行根据第一分类标识对历史信息文本进行分类,得到多个第一分类集合,其中,每个第一分类集合中历史信息文本的第一分类标识相同;The first classification set determining unit is configured to perform classifying the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each first classification set are the same;
第二分类集合确定单元被配置为执行根据历史文本得到历史信息文本的第二分类标识,并根据第二分类标识通过预设聚类算法分别对各个第一分类集合中的历史信息文 本再次进行分类,得到多个第二分类集合;The second classification set determining unit is configured to perform obtaining a second classification identification of the historical information text according to the historical text, and to classify the historical information text in each first classification set again according to the second classification identification through a preset clustering algorithm. , to obtain multiple second classification sets;
第三分类集合确定单元被配置为执行根据第一分类标识和第二分类标识得到聚合标识,并根据聚合标识分别对各个第二分类集合中的历史信息文本再次进行分类,得到多个第三分类集合;The third classification set determining unit is configured to obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and to reclassify the historical information texts in each of the second classification sets according to the aggregated identifier to obtain a plurality of third classifications gather;
余弦相似度计算单元被配置为执行对于各个第三分类集合中的历史信息文本,计算历史信息文本中所包含的历史文本两两之间的余弦相似度,并将余弦相似度大于第一相似度阈值的历史信息文本放入同一个相似信息文本集合中。The cosine similarity calculation unit is configured to perform, for each historical information text in the third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and set the cosine similarity greater than the first similarity. Thresholded historical infotexts are put into the same set of similar infotexts.
在本公开的一些示例性实施例中,第二分类集合确定单元可以包括聚类簇数确定单元以及信息文本划分单元。其中:In some exemplary embodiments of the present disclosure, the second classification set determination unit may include a cluster number determination unit and an information text division unit. in:
聚类簇数确定单元被配置为执行根据各个第一分类集合中历史信息文本的总数,确定各个第一分类集合对应的聚类簇数;The cluster number determination unit is configured to determine the number of clusters corresponding to each first classification set according to the total number of historical information texts in each first classification set;
信息文本划分单元被配置为执行根据第二分类标识,通过预设聚类算法将各个第一分类集合中的历史信息文本划分为与聚类簇数相对应的多个第二分类集合。The information text dividing unit is configured to divide the historical information texts in each first classification set into a plurality of second classification sets corresponding to the number of clusters by using a preset clustering algorithm according to the second classification identification.
在本公开的一些示例性实施例中,聚类簇数确定单元可以包括第一聚类簇数确定单元以及第二聚类簇数确定单元。其中:In some exemplary embodiments of the present disclosure, the cluster number determination unit may include a first cluster number determination unit and a second cluster number determination unit. in:
第一聚类簇数确定单元被配置为执行若第一分类集合中历史信息文本的总数大于或等于文本数量阈值,则根据历史信息文本的总数和预设比值确定第一分类集合对应的聚类簇数;The first cluster number determination unit is configured to execute, if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, determine the cluster corresponding to the first classification set according to the total number of historical information texts and the preset ratio. number of clusters;
第二聚类簇数确定单元被配置为执行若第一分类集合中历史信息文本的总数小于或等于文本数量阈值,则获取预设聚类簇数作为第一分类集合对应的聚类簇数。The second cluster number determination unit is configured to obtain a preset number of clusters as the number of clusters corresponding to the first classification set if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold.
在本公开的一些示例性实施例中,本公开提供的一种文本的标准化处理装置还可以包括文本成分规则集合生成模块。其中:In some exemplary embodiments of the present disclosure, the apparatus for normalizing text provided by the present disclosure may further include a text component rule set generating module. in:
文本成分规则集合生成模块可以包括历史文本获取单元、有效文本成分获取单元、余弦相似度计算单元以及规则集合生成单元。The text component rule set generation module may include a historical text acquisition unit, a valid text component acquisition unit, a cosine similarity calculation unit, and a rule set generation unit.
历史文本获取单元被配置为执行获取历史信息文本中的所包含历史文本;The historical text acquisition unit is configured to perform acquisition of historical text contained in the historical information text;
有效文本成分获取单元被配置为执行对历史文本进行分词处理,得到历史文本中所包含的各个有效历史文本成分;The effective text component acquisition unit is configured to perform word segmentation processing on the historical text to obtain each effective historical text component contained in the historical text;
余弦相似度计算单元被配置为执行将有效历史文本成分与文本成分规则集合中的文本成分进行余弦相似度计算;The cosine similarity calculation unit is configured to perform cosine similarity calculation between the valid historical text components and the text components in the text component rule set;
规则集合生成单元被配置为执行若有效历史文本成分与文本成分规则集合中的文本成分之间的余弦相似度大于第二相似度阈值,则将有效历史文本成分添加到文本成分规则集合中。The rule set generation unit is configured to perform adding the valid historical text components to the text component rule set if the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than a second similarity threshold.
上述文本的标准化处理装置中各模块/单元的具体细节在相应的方法实施例部分已有详细的说明,此处不再赘述。The specific details of each module/unit in the standardization processing apparatus of the above text have been described in detail in the corresponding method embodiment section, and will not be repeated here.
图10示出了适于用来实现本发明实施例的电子设备的计算机系统的结构示意图。FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present invention.
需要说明的是,图10示出的电子设备的计算机系统1000仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1000 of the electronic device shown in FIG. 10 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.
如图10所示,计算机系统1000包括中央处理单元(CPU)1001,其可以根据存储在只读存储器(ROM)1002中的程序或者从存储部分1008加载到随机访问存储器(RAM)1003中的程序而执行各种适当的动作和处理。在RAM 1003中,还存储有系统操作所需的各种程序和数据。CPU 1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG. 10, a computer system 1000 includes a central processing unit (CPU) 1001, which can be loaded into a random access memory (RAM) 1003 according to a program stored in a read only memory (ROM) 1002 or a program from a storage section 1008 Instead, various appropriate actions and processes are performed. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004 .
以下部件连接至I/O接口1005:包括键盘、鼠标等的输入部分1006;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1007;包括硬盘等的存储部分1008;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1010也根据需要连接至I/O接口1005。可拆卸介质1011,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1010上,以便于从其上读出的计算机程序根据需要被安装入存储部分1008。The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1008 including a hard disk, etc. ; and a communication section 1009 including a network interface card such as a LAN card, a modem, and the like. The communication section 1009 performs communication processing via a network such as the Internet. A drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1010 as needed so that a computer program read therefrom is installed into the storage section 1008 as needed.
特别地,根据本发明的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本发明的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理单元(CPU)1001执行时,执行本公开的系统中限定的各种功能。In particular, the processes described above with reference to the flowcharts may be implemented as computer software programs according to embodiments of the present invention. For example, embodiments of the present invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 1009, and/or installed from the removable medium 1011. When the computer program is executed by the central processing unit (CPU) 1001, various functions defined in the system of the present disclosure are executed.
需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.
作为另一方面,本公开还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如上述实施例中所述的方法。As another aspect, the present disclosure also provides a computer-readable medium. The computer-readable medium may be included in the electronic device described in the above embodiments; it may also exist alone without being assembled into the electronic device. middle. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by an electronic device, causes the electronic device to implement the methods described in the above-mentioned embodiments.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块的特征和功能可以在一个模块中具体化。反之,上文描述的一个模块的特征和功能可以进一步划分为由多个模块来具体化。It should be noted that although several modules of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, in accordance with embodiments of the present disclosure, the features and functions of two or more modules described above may be embodied in one module. Conversely, the features and functions of one module described above can be further divided into multiple modules to be embodied.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure .
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (11)

  1. 一种文本的标准化处理方法,包括:A method of normalizing text, including:
    获取原始信息文本,所述原始信息文本中包括待处理的原始文本;Obtain original information text, the original information text includes the original text to be processed;
    根据预先生成的信息文本同义词典对所述原始信息文本进行匹配,得到所述原始信息文本中的所述原始文本对应的目标文本;Matching the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;
    对所述目标文本进行分词处理,得到所述目标文本中所包含的各个有效文本成分;Perform word segmentation processing on the target text to obtain each valid text component contained in the target text;
    获取预先生成的文本成分规则集合,并将各个所述有效文本成分中不属于所述文本成分规则集合的所述有效文本成分作为标准文本成分;Obtaining a pre-generated text component rule set, and using the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;
    根据所述标准文本成分得到所述原始文本对应的标准化文本。The standardized text corresponding to the original text is obtained according to the standard text components.
  2. 根据权利要求1所述的文本的标准化处理方法,其中,所述根据预先生成的信息文本同义词典对所述原始信息文本进行匹配,得到所述原始信息文本中的所述原始文本对应的目标文本,包括:The standardization processing method of text according to claim 1, wherein the original information text is matched according to a pre-generated information text thesaurus to obtain a target corresponding to the original text in the original information text text, including:
    若所述信息文本同义词典中存在与所述原始信息文本相关的目标信息文本,则将所述目标信息文本中包含的目标文本作为所述原始文本对应的目标文本;If there is a target information text related to the original information text in the information text thesaurus, the target text contained in the target information text is used as the target text corresponding to the original text;
    若所述信息文本同义词典中不存在与所述原始信息文本相关的目标信息文本,则将所述原始文本作为所述目标文本。If there is no target information text related to the original information text in the information text thesaurus, the original text is used as the target text.
  3. 根据权利要求1所述的文本的标准化处理方法,其中,所述对所述目标文本进行分词处理,得到所述目标文本中所包含的各个有效文本成分,包括:The standardization processing method of text according to claim 1, wherein, by performing word segmentation processing on the target text, each valid text component contained in the target text is obtained, comprising:
    将所述目标文本中的无效文本成分进行过滤处理;filtering the invalid text components in the target text;
    对过滤之后的所述目标文本进行分词处理,得到所述目标文本中所包含的各个有效文本成分。Perform word segmentation processing on the filtered target text to obtain each effective text component contained in the target text.
  4. 根据权利要求1所述的文本的标准化处理方法,其中,所述信息文本同义词典的生成方法包括:The standardization processing method of text according to claim 1, wherein the generating method of the information text thesaurus comprises:
    获取历史信息文本,所述历史信息文本中所包含历史文本,以及所述历史文本对应的数据信息;Obtain historical information text, the historical text contained in the historical information text, and the data information corresponding to the historical text;
    根据所述历史文本和所述历史文本对应的数据信息,对所述历史信息文本进行分类,得到多组相似信息文本集合;According to the historical text and the data information corresponding to the historical text, classify the historical information text to obtain multiple sets of similar information texts;
    根据所述多组相似信息文本集合生成所述信息文本同义词典。The information text thesaurus is generated according to the sets of similar information texts.
  5. 根据权利要求4所述的文本的标准化处理方法,其中,所述根据所述历史文本和所述历史文本对应的数据信息,对所述历史信息文本进行分类,得到多组相似信息文本集合,包括:The standardization processing method of text according to claim 4, wherein the historical information text is classified according to the historical text and the data information corresponding to the historical text to obtain a plurality of sets of similar information texts, comprising: :
    根据所述历史文本对应的数据信息得到所述历史信息文本的第一分类标识;Obtain the first classification identifier of the historical information text according to the data information corresponding to the historical text;
    根据所述第一分类标识对所述历史信息文本进行分类,得到多个第一分类集合,其中,每个所述第一分类集合中所述历史信息文本的第一分类标识相同;Classify the historical information text according to the first classification identifier to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information text in each of the first classification sets are the same;
    根据所述历史文本得到所述历史信息文本的第二分类标识,并根据所述第二分类标 识通过预设聚类算法分别对各个所述第一分类集合中的历史信息文本再次进行分类,得到多个第二分类集合;A second classification identifier of the historical information text is obtained according to the historical text, and the historical information texts in each of the first classification sets are reclassified by a preset clustering algorithm according to the second classification identifier, to obtain a plurality of second classification sets;
    根据所述第一分类标识和所述第二分类标识得到聚合标识,并根据所述聚合标识分别对各个所述第二分类集合中的历史信息文本再次进行分类,得到多个第三分类集合;Obtain an aggregate identifier according to the first classification identifier and the second classification identifier, and re-classify the historical information texts in each of the second classification sets according to the aggregate identifier to obtain a plurality of third classification sets;
    对于各个所述第三分类集合中的历史信息文本,计算所述历史信息文本中所包含的历史文本两两之间的余弦相似度,并将所述余弦相似度大于第一相似度阈值的所述历史信息文本放入同一个相似信息文本集合中。For each of the historical information texts in the third classification set, calculate the cosine similarity between the historical texts included in the historical information text, and calculate the cosine similarity between the historical texts that are greater than the first similarity threshold. Put the historical information texts into the same set of similar information texts.
  6. 根据权利要求5所述的文本的标准化处理方法,其中,所述根据所述第二分类标识通过预设聚类算法分别对各个所述第一分类集合中的历史信息文本再次进行分类,得到多个第二分类集合,包括:The method for standardizing texts according to claim 5, wherein the historical information texts in each of the first classification sets are re-classified according to the second classification identifiers through a preset clustering algorithm to obtain multiple classifications. A second classification set, including:
    根据各个所述第一分类集合中所述历史信息文本的总数,确定各个所述第一分类集合对应的聚类簇数;Determine the number of clusters corresponding to each of the first classification sets according to the total number of the historical information texts in each of the first classification sets;
    根据所述第二分类标识,通过预设聚类算法将各个所述第一分类集合中的历史信息文本划分为与所述聚类簇数相对应的多个第二分类集合。According to the second classification identifier, the historical information text in each of the first classification sets is divided into a plurality of second classification sets corresponding to the number of clusters by a preset clustering algorithm.
  7. 根据权利要求6所述的文本的标准化处理方法,其中,所述根据各个所述第一分类集合中所述历史信息文本的总数,确定各个所述第一分类集合对应的聚类簇数,包括:The method for standardizing texts according to claim 6, wherein the determining the number of clusters corresponding to each of the first classification sets according to the total number of the historical information texts in each of the first classification sets, comprising: :
    若所述第一分类集合中所述历史信息文本的总数大于或等于文本数量阈值,则根据所述历史信息文本的总数和预设比值确定所述第一分类集合对应的聚类簇数;If the total number of the historical information texts in the first classification set is greater than or equal to the text quantity threshold, then determine the number of clusters corresponding to the first classification set according to the total number of the historical information texts and a preset ratio;
    若所述第一分类集合中所述历史信息文本的总数小于或等于所述文本数量阈值,则获取预设聚类簇数作为所述第一分类集合对应的聚类簇数。If the total number of the historical information texts in the first classification set is less than or equal to the text quantity threshold, a preset number of clusters is acquired as the number of clusters corresponding to the first classification set.
  8. 根据权利要求1所述的文本的标准化处理方法,其中,所述文本成分规则集合的生成方法包括:The method for standardizing text according to claim 1, wherein the method for generating the text component rule set comprises:
    获取历史信息文本中的所包含历史文本;Obtain the historical text contained in the historical information text;
    对所述历史文本进行分词处理,得到所述历史文本中所包含的各个有效历史文本成分;Perform word segmentation processing on the historical text to obtain each effective historical text component contained in the historical text;
    将所述有效历史文本成分与所述文本成分规则集合中的文本成分进行余弦相似度计算;performing cosine similarity calculation on the effective historical text components and the text components in the text component rule set;
    若所述有效历史文本成分与所述文本成分规则集合中的文本成分之间的余弦相似度大于第二相似度阈值,则将所述有效历史文本成分添加到所述文本成分规则集合中。If the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than a second similarity threshold, the valid historical text components are added to the text component rule set.
  9. 一种文本的标准化处理装置,包括:A text standardization processing device, comprising:
    原始信息文本获取模块,被配置为执行获取原始信息文本,所述原始信息文本中包括待处理的原始文本;an original information text acquisition module, configured to execute and acquire original information text, the original information text includes the original text to be processed;
    原始信息文本匹配模块,被配置为执行根据预先生成的信息文本同义词典对所述原始信息文本进行匹配,得到所述原始信息文本中的所述原始文本对应的目标文本;an original information text matching module, configured to perform matching on the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;
    有效文本成分获取模块,被配置为执行对所述目标文本进行分词处理,得到所述目标文本中所包含的各个有效文本成分;an effective text component acquisition module, configured to perform word segmentation processing on the target text to obtain each effective text component contained in the target text;
    标准文本成分确定模块,被配置为执行获取预先生成的文本成分规则集合,并将各个所述有效文本成分中不属于所述文本成分规则集合的所述有效文本成分作为标准文本成分;a standard text component determination module, configured to execute and acquire a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;
    标准化文本生成模块,被配置为执行根据所述标准文本成分得到所述原始文本对应的标准化文本。The standardized text generation module is configured to obtain standardized text corresponding to the original text according to the standard text components.
  10. 一种电子设备,包括:An electronic device comprising:
    处理器;以及processor; and
    存储器,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1至8中任一项所述的文本的标准化处理方法。memory for storing one or more programs which, when executed by said one or more processors, cause said one or more processors to implement any one of claims 1 to 8 A method for normalizing the text described in Item.
  11. 一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现如权利要求1至8中任一项所述的文本的标准化处理方法。A computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the normalization processing method of text according to any one of claims 1 to 8.
PCT/CN2021/127971 2020-12-29 2021-11-01 Standardization processing method and apparatus for text, and electronic device and computer medium WO2022142703A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011594885.4 2020-12-29
CN202011594885.4A CN112700881B (en) 2020-12-29 2020-12-29 Text standardization processing method and device, electronic equipment and computer medium

Publications (1)

Publication Number Publication Date
WO2022142703A1 true WO2022142703A1 (en) 2022-07-07

Family

ID=75511901

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127971 WO2022142703A1 (en) 2020-12-29 2021-11-01 Standardization processing method and apparatus for text, and electronic device and computer medium

Country Status (2)

Country Link
CN (2) CN114613516B (en)
WO (1) WO2022142703A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306638A (en) * 2023-05-22 2023-06-23 上海维智卓新信息科技有限公司 POI data matching method, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114613516B (en) * 2020-12-29 2022-12-06 医渡云(北京)技术有限公司 Text standardization processing method and device, electronic equipment and computer medium
CN114596182B (en) * 2022-03-09 2023-05-16 王淑娟 Government affair management method and system based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040024760A1 (en) * 2002-07-31 2004-02-05 Phonetic Research Ltd. System, method and computer program product for matching textual strings using language-biased normalisation, phonetic representation and correlation functions
US20050055372A1 (en) * 2003-09-04 2005-03-10 Microsoft Corporation Matching media file metadata to standardized metadata
US20140214778A1 (en) * 2006-02-17 2014-07-31 Google Inc. Entity Normalization Via Name Normalization
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
CN111881680A (en) * 2020-08-04 2020-11-03 医渡云(北京)技术有限公司 Text standardization processing method and device, electronic equipment and computer medium
CN112700881A (en) * 2020-12-29 2021-04-23 医渡云(北京)技术有限公司 Text standardization processing method and device, electronic equipment and computer medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089302B2 (en) * 2013-02-26 2018-10-02 International Business Machines Corporation Native-script and cross-script chinese name matching
KR101482430B1 (en) * 2013-08-13 2015-01-15 포항공과대학교 산학협력단 Method for correcting error of preposition and apparatus for performing the same
US8831969B1 (en) * 2013-10-02 2014-09-09 Linkedin Corporation System and method for determining users working for the same employers in a social network
CN105095204B (en) * 2014-04-17 2018-12-14 阿里巴巴集团控股有限公司 The acquisition methods and device of synonym
CN107729309B (en) * 2016-08-11 2022-11-08 中兴通讯股份有限公司 Deep learning-based Chinese semantic analysis method and device
CN106446025B (en) * 2016-08-30 2019-10-11 东软集团股份有限公司 A kind of method and apparatus of standardized text information
CN110909226B (en) * 2019-11-28 2023-06-06 达而观信息科技(上海)有限公司 Financial document information processing method and device, electronic equipment and storage medium
CN111046632B (en) * 2019-11-29 2023-11-10 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN111813399B (en) * 2020-07-23 2022-05-31 平安医疗健康管理股份有限公司 Machine learning-based auditing rule processing method and device and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040024760A1 (en) * 2002-07-31 2004-02-05 Phonetic Research Ltd. System, method and computer program product for matching textual strings using language-biased normalisation, phonetic representation and correlation functions
US20050055372A1 (en) * 2003-09-04 2005-03-10 Microsoft Corporation Matching media file metadata to standardized metadata
US20140214778A1 (en) * 2006-02-17 2014-07-31 Google Inc. Entity Normalization Via Name Normalization
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
CN111881680A (en) * 2020-08-04 2020-11-03 医渡云(北京)技术有限公司 Text standardization processing method and device, electronic equipment and computer medium
CN112700881A (en) * 2020-12-29 2021-04-23 医渡云(北京)技术有限公司 Text standardization processing method and device, electronic equipment and computer medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306638A (en) * 2023-05-22 2023-06-23 上海维智卓新信息科技有限公司 POI data matching method, electronic equipment and storage medium
CN116306638B (en) * 2023-05-22 2023-08-11 上海维智卓新信息科技有限公司 POI data matching method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114613516A (en) 2022-06-10
CN112700881A (en) 2021-04-23
CN112700881B (en) 2022-04-08
CN114613516B (en) 2022-12-06

Similar Documents

Publication Publication Date Title
WO2022142703A1 (en) Standardization processing method and apparatus for text, and electronic device and computer medium
Yang et al. Fast network embedding enhancement via high order proximity approximation.
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
US9323794B2 (en) Method and system for high performance pattern indexing
Dong et al. From data fusion to knowledge fusion
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
WO2023092961A1 (en) Semi-supervised method and apparatus for public opinion text analysis
CN109815336B (en) Text aggregation method and system
CN109885697B (en) Method, apparatus, device and medium for constructing data model
WO2022160454A1 (en) Medical literature retrieval method and apparatus, electronic device, and storage medium
CN116089873A (en) Model training method, data classification and classification method, device, equipment and medium
CN114398968B (en) Method and device for labeling similar customer-obtaining files based on file similarity
WO2022116443A1 (en) Sentence discrimination method and apparatus, and device and storage medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
WO2024066903A1 (en) Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium
Jiang et al. P-gram: positional N-gram for the clustering of machine-generated messages
CN111523309A (en) Medicine information normalization method and device, storage medium and electronic equipment
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN113239150A (en) Text matching method, system and equipment
Islam et al. ModER: Graph-Based Unsupervised Entity Resolution Using Composite Modularity Optimization and Locality Sensitive Hashing
Ranjbar-Sahraei et al. Distant supervision of relation extraction in sparse data
WO2023227030A1 (en) Intention recognition method and apparatus, storage medium and electronic device
CN117573956B (en) Metadata management method, device, equipment and storage medium
Yu et al. A Method of Constructing Feature Lexicon Based on Word Level
Karthica et al. A STUDY ON TECHNIQUES AND TOOLS ASSOCIATE WITH WEB CONTENT

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913447

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913447

Country of ref document: EP

Kind code of ref document: A1