WO2022142703A1

WO2022142703A1 - Standardization processing method and apparatus for text, and electronic device and computer medium

Info

Publication number: WO2022142703A1
Application number: PCT/CN2021/127971
Authority: WO
Inventors: 滕召荣; 刘斌; 郝东林
Original assignee: 医渡云（北京）技术有限公司
Priority date: 2020-12-29
Filing date: 2021-11-01
Publication date: 2022-07-07
Also published as: CN114613516A; CN112700881A; CN112700881B; CN114613516B

Abstract

A standardization processing method and apparatus for text, and an electronic device and a computer-readable medium, which belong to the technical field of data processing. The method comprises: acquiring original information text, wherein the original information text comprises original text to be processed (S210); performing matching on the original information text according to a pre-generated information text synonym dictionary, so as to obtain target text corresponding to the original text in the original information text (S220); performing word segmentation processing on the target text to obtain effective text components included in the target text (S230); acquiring a pre-generated text component rule set, and taking an effective text component, which does not belong to the text component rule set, from among the effective text components as a standard text component (S240); and obtaining, according to the standard text component, standardized text corresponding to the original text (S250). By means of an information text synonym dictionary and a text component rule set, normalization processing is performed on original text to obtain standardized text, such that the efficiency and accuracy of text normalization can be improved.

Description

Standardized processing method, device, electronic device and computer medium for text

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure claims the priority of the Chinese patent application with the application number 202011594885.4 and the title of "Method, Apparatus, Electronic Equipment and Computer Medium for Standardized Processing of Text" filed on December 29, 2020, the entire content of which is approved by Reference is incorporated herein in its entirety.

technical field

The present disclosure relates to the technical field of data processing, and in particular, to a method for standardizing text, an apparatus for standardizing text, an electronic device, and a computer-readable medium.

Background technique

Due to the variety of texts such as names or addresses in foreign languages, it is difficult to have a unified standard. Therefore, the results obtained by normalization are often inaccurate. In many cases, manual identification and processing are required, which is inefficient.

In view of this, there is an urgent need in the art for a text normalization processing method that can improve the efficiency and accuracy of text normalization.

It should be noted that the information disclosed in the above Background section is only for enhancement of understanding of the background of the present disclosure, and therefore may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

SUMMARY OF THE INVENTION

The purpose of the present disclosure is to provide a text normalization processing method, a text normalization processing device, an electronic device, and a computer-readable medium, so as to improve the efficiency and accuracy of text normalization at least to a certain extent.

According to a first aspect of the present disclosure, a method for standardizing text is provided, comprising:

Obtain original information text, the original information text includes the original text to be processed;

Matching the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;

Perform word segmentation processing on the target text to obtain each valid text component contained in the target text;

Obtaining a pre-generated text component rule set, and using the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;

The standardized text corresponding to the original text is obtained according to the standard text components.

According to a second aspect of the present disclosure, there is provided an apparatus for standardizing text, comprising:

an original information text acquisition module, configured to execute and acquire original information text, the original information text includes the original text to be processed;

an original information text matching module, configured to perform matching on the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;

A valid text component acquisition module, configured to perform word segmentation processing on the target text, to obtain each valid text component contained in the target text;

a standard text component determination module, configured to execute and acquire a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;

The standardized text generation module is configured to obtain standardized text corresponding to the original text according to the standard text components.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to Execute the normalization processing method of the text described in any one of the above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the above-described normalization processing methods for text.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

1 shows a schematic flowchart of Malay name normalization according to a related embodiment of the present disclosure;

FIG. 2 shows a schematic flowchart of a text standardization processing method according to an exemplary embodiment of the present disclosure;

3 shows a schematic flowchart of a method for generating an information text thesaurus according to an exemplary embodiment of the present disclosure;

4 shows a schematic flowchart of obtaining multiple sets of similar information text sets according to an exemplary embodiment of the present disclosure;

5 shows a schematic flowchart of a method for generating a text component rule set according to an exemplary embodiment of the present disclosure;

6 shows a schematic flowchart of a method for standardizing text in a specific embodiment of the present disclosure;

7 shows a schematic flowchart of a method for generating an information text thesaurus according to a specific embodiment of the present disclosure;

8 shows a schematic flowchart of a method for generating a text component rule set according to an embodiment of the present disclosure;

FIG. 9 shows a block diagram of a text normalization processing apparatus according to an exemplary embodiment of the present disclosure;

FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repeated descriptions will be omitted. Some of the block diagrams shown in the figures are functional entities that do not necessarily necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Normalization refers to the standardization of data. After different data is processed by normalization (a certain algorithm), it can be made into the same standard data. Data normalization or data normalization is a direction of NLP (Natural Language Processing) technology, which refers to the process of data normalization processing through NLP technical means.

For the unique identification of multi-source data calculation personnel, the most typical one is to use the name, date and gender to calculate the unique identification of the person. Taking the Malay language system as an example, the names of personnel in the Malay language system are quite special, and the most notable feature is that with the increase of age, the names will continue to change. For example, the Malay name will add the symbol of adulthood when a person is an adult; when a certain social title is obtained, the symbol of the title will be added; when going to a religious holy place, it will be added to the name. place identification. This kind of name change brings great challenges to the unique identification of computing personnel. Therefore, it is necessary to standardize the name text.

The name of the Malay system is composed of title + duplicate name + first name + title + parent title + parent duplicate name + parent first name, including title, duplicate name, title, parent title, parent duplicate name They are all variable parts and may change over time. Therefore, for the normalization of Malay names, it refers to removing the variable parts of Malay names by technical means, leaving only the fixed or immutable parts.

In some related embodiments, taking the normalization of Malay name text as an example, it can be realized by a complete flowchart of Malay name normalization as shown in FIG. 1 , and the specific steps of the flowchart are as follows :

Step S102. Obtain the Malay name text.

Step S104. Name text preprocessing.

When normalizing Malay names, the name text needs to be preprocessed first. The preprocessing process includes cleaning some special characters, such as ")", "(", "." and other symbols; in addition, it is also necessary to remove Some meaningless special words, such as "unknown", "B/O", "Baby of", etc.

Step S106. Name text segmentation.

Words can be split according to spaces.

Step S108. Obtain the Malay name and word frequency mapping table.

The Malay name and word count mapping table is to count the number of occurrences of the word in the Malay name in the historical text data, and build a HASH (Hash) mapping relationship of the number of words according to the statistical result, where the number of times refers to all The total number of occurrences of the name word. The above-mentioned Malay name word frequency mapping table can be used as a basic dictionary for name unification.

Among them, HASH map, also known as hash map or hash map or HashMap, is a collection used to store key-value pairs, each key-value pair is also called Entry, and these Entry are stored in a In an array, this array is a HashMap.

Step S110. Perform name and word frequency mapping according to the Malay name and word frequency mapping table.

For the words after the word segmentation of the name text, the word frequency data of a single name is obtained according to the Malay name word frequency mapping table.

Step S112. Build a minimum heap.

Build a min-heap from word count data. Among them, the min heap refers to a sorted complete binary tree, in which the data value of any non-terminal node is not greater than the value of its left and right child nodes. A min-heap is usually used to find N minimum values.

Step S114. Take the 2 words with the smallest number of mappings.

Step S116. Synthesize normalized name text.

Finally, word merging is performed to obtain the normalized name text.

By normalizing Malay names to obtain the core invariant part of Malay names, people can be uniquely identified in multi-source big data through the normalized names, dates of birth, and genders.

The normalization method in the above-mentioned related embodiments is based on the assumption that the differences in Malay names are relatively large, and should be understood from common sense, but the above methods have the following problems:

On the one hand, Malay names are not necessarily a certain number of words after normalization. For example, some names are normalized with 2 words, some names may be 3 words after normalization, and some have It may be 4 words, so the normalized result obtained by taking only 2 or only a fixed number of words will have insufficient flexibility and inaccurate normalization.

On the other hand, in order to address some cases of name normalization errors, it may be necessary to adjust the priority of some words. If the priority of a word is adjusted manually, it will affect the subsequent merging of the manually adjusted priority word with the automatically constructed word; in addition, the adjusted word priority may cause errors in the normalization process of some names. Therefore, the applicability or generalization of the above scheme is not enough.

Based on the above problems, the present exemplary embodiment first provides a method for standardizing text. Referring to Fig. 2, the standardization processing method of the above text may include the following steps:

Step S210. Obtain the original information text, where the original information text includes the original text to be processed.

Step S220. Match the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.

Step S230. Perform word segmentation processing on the target text to obtain each effective text component contained in the target text.

Step S240. Obtain a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each valid text component as standard text components.

Step S250. Obtain standardized text corresponding to the original text according to the standard text components.

In the text standardization processing method according to the exemplary embodiment of the present disclosure, on the one hand, by matching the original information text through a pre-generated information text thesaurus, the synonyms of the original text in the original information text can be found, so that the correct and wrong Written, abbreviated, reversed, and co-written texts are mined to improve the overall recall rate of the text and the accuracy of standardized processing. On the other hand, by matching each valid text component in the original text through a pre-generated text component rule set, text rules can be discovered, manual participation in the processing process can be reduced, and processing efficiency can be improved. Finally, the normalization processing method of text in the exemplary embodiment of the present disclosure performs normalization processing on the original text, which can greatly improve the computability and relevance of text data in the multi-source big data scenario. In the process, the efficiency of text data statistics and management can be further improved.

Hereinafter, the above steps of this exemplary embodiment will be described in more detail with reference to FIGS. 3 to 5 .

In step S210, the original information text is obtained, and the original information text includes the original text to be processed.

In this example implementation, the original information text refers to a complete text including the original text to be processed and some data information corresponding to the original text, where the original text to be processed is the text that needs to be standardized. For example, the original text to be processed may be name text or address text, etc. Taking the name text as an example, the original information text may be the complete text of the name, date of birth and gender including the name text, and the data information corresponding to the original text is Date of birth and gender.

In step S220, the original information text is matched according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text.

In this example implementation, since the data source of the text is relatively complex, and words may have a series of problems such as misspelling, abbreviation, and joint writing, in order to improve the accuracy and recall rate of text normalization, it is necessary to generate information text synonyms in advance Dictionary, finds text synonyms on the full amount of data, and converts the original text with synonyms into the corresponding target text. Among them, the target text refers to the unified target text converted into each group of synonymous texts.

In this example implementation, when the original information text is matched according to the pre-generated information text thesaurus, if there is a target information text related to the original information text in the information text thesaurus, the The target text is used as the target text corresponding to the original text; if there is no target information text related to the original information text in the information text thesaurus, the original text is used as the target text.

For example, compare the name text with the pre-generated name, gender and birthday thesaurus. If there is a target name text that is synonymous with the name text in the thesaurus, convert the name text to the target name text; if not, convert the name text to the target name text. Then directly use the original name text to process the subsequent steps.

In this exemplary implementation, as shown in FIG. 3 , the method for generating the thesaurus of information text may specifically include the following steps:

Step S310. Obtain historical information text, historical text contained in the historical information text, and data information corresponding to the historical text.

First, the historical information text containing the historical text is obtained from the historical data, and the data information corresponding to the historical text in the historical information text is obtained. The historical text may be, for example, historical name text, and the data information corresponding to the historical text may be, for example, gender and date of birth data corresponding to the historical name.

Step S320. According to the historical text and the data information corresponding to the historical text, classify the historical information text to obtain multiple sets of similar information texts.

In this example implementation, as shown in FIG. 4 , according to the historical text and the data information corresponding to the historical text, the historical information text is classified to obtain multiple sets of similar information texts, which may specifically include the following steps:

Step S410. Obtain the first classification identifier of the historical information text according to the data information corresponding to the historical text.

The first classification identifier refers to the classification identifier used when classifying the historical information text for the first time. For example, the first classification identifier may be generated according to the gender and date of birth data corresponding to the historical name.

Step S420. Classify the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each of the first classification sets are the same.

Perform the first aggregation classification on the historical information texts according to the first classification identifiers, and aggregate the historical information texts with the same first classification identifiers.

Step S430. Obtain the second classification identifier of the historical information text according to the historical text, and classify the historical information text in each first classification set by a preset clustering algorithm according to the second classification identifier, and obtain a plurality of second classifications. gather.

After the first classification of the historical information text, the second classification is performed on the historical information text in each first classification set according to the second classification identifier. The second classification identifier may be generated according to historical text, for example, the second classification identifier may be generated according to historical names.

For the second classification of the historical information text, a preset clustering algorithm such as K-Means (K-means clustering algorithm) can be used. K-Means is the most commonly used clustering algorithm. The biggest feature of the algorithm is that it is simple, easy to understand, and fast in operation. Before clustering, it is necessary to specify the number of clusters to be classified.

In this example implementation, the method for classifying the historical information texts in the first classification set by using a preset clustering algorithm may specifically be: according to the total number of historical information texts in each first classification set, determine the corresponding The number of clusters; according to the second classification identifier, the historical information text in each first classification set is divided into a plurality of second classification sets corresponding to the number of clusters by a preset clustering algorithm.

Wherein, the method for determining the number of clusters corresponding to each first classification set may be: if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, then determine the first classification set according to the total number of historical information texts and a preset ratio. The number of clusters corresponding to a classification set; if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold, the preset number of clusters is obtained as the number of clusters corresponding to the first classification set.

For example, when the number of historical information texts in the first classification set is greater than or equal to 3, an integer two-thirds of the number of historical information texts can be taken as the number of clusters; when the number of historical information texts in the first classification set is When the number is less than 3, you can directly set the value of the number of clusters to 1.

Step S440. Obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and reclassify the historical information texts in each of the second classification sets according to the aggregated identifiers to obtain a plurality of third classification sets.

After the historical information text is classified for the second time, a new aggregation identifier can be generated according to the first classification identifier and the second classification identifier, and then the historical information text in each second classification set is classified for the third time according to the aggregate identifier, and the Aggregate IDs of the same historical information text data are aggregated together.

Step S450. For the historical information text in each third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and put the historical information text whose cosine similarity is greater than the first similarity threshold. into the same collection of similar information texts.

After the second classification of the historical information texts, the historical information texts have been divided into as many categories as possible. Calculate the cosine similarity to get each group of synonyms in the thesaurus. For example, if the cosine similarity of two historical texts is greater than 0.97, they are put into the same set of similar information texts.

Among them, cosine similarity, also known as cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of the angle between them, which can be applied to the calculation of text similarity.

Step S330. Generate an information text thesaurus according to multiple sets of similar information text sets.

Finally, an information text thesaurus is generated according to multiple sets of similar information texts, which is used for the conversion of historical information text synonyms.

In addition, the text data can be marked first by deep learning, and then the corresponding synonymous text can be calculated by using the deep learning related algorithm, so as to achieve the same conversion effect.

In step S230, word segmentation processing is performed on the target text to obtain each valid text component contained in the target text.

In this example implementation, before performing word segmentation processing on the target text, the target text may be preprocessed first, and the specific method may be: filtering invalid text components in the target text; performing word segmentation processing on the filtered target text , to obtain each effective text component contained in the target text.

The preprocessing process can include clearing some special characters, such as ")", "(", "." and other symbols; in addition, it also needs to clear some meaningless special words, such as "unknown", "B/O", "Baby" of" and other words.

In step S240, a pre-generated text component rule set is acquired, and a valid text component of each valid text component that does not belong to the text component rule set is used as a standard text component.

Through the pre-generated text component rule set, the text components that do not need to be normalized in the effective text components can be deleted, and only a part of the effective text components required for normalization are left. Taking the normalization of Malay names as an example, through the pre-generated text component rule set, the variable words in Malay names can be deleted, and only the fixed words are left, which is the last normalization method. The words used, i.e. the standard text components.

In this example embodiment, as shown in Figure 5, the generation method of the text component rule set can specifically include the following steps:

Step S510. Obtain the historical text contained in the historical information text.

First, the contained historical text in the historical information text, such as the historical name text, is obtained.

Step S520. Perform word segmentation on the historical text to obtain each effective historical text component contained in the historical text.

Before performing word segmentation processing on the target text, the target text can also be preprocessed to remove some special characters and meaningless special words, so as to obtain each effective historical text component contained in the historical text.

Step S530. Calculate the cosine similarity between the valid historical text components and the text components in the text component rule set.

Then the cosine similarity is calculated between the effective historical text components and the existing text components in the text component rule set.

Step S540. If the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than the second similarity threshold, add the valid historical text components to the text component rule set.

For example, when the cosine similarity between the valid historical text component and any text component in the text component rule set is greater than 0.95, the valid historical text component can be marked and added to the text component rule set.

In step S250, the standardized text corresponding to the original text is obtained according to the standard text components.

Finally, according to the final reserved standard text components, they are merged in the original order to obtain the standard text corresponding to the original text.

With the text normalization processing method in this example embodiment, any number of words can be adaptively normalized to represent the core normalized part of the original text, instead of artificially specifying the number of words.

As shown in FIG. 6 is a complete flow chart of text normalization processing in a specific embodiment of the present disclosure, which can be applied to the normalization of Malay name texts, and is an example of the above steps in this exemplary embodiment. The specific steps of the flowchart are as follows:

Step S602. Obtain the Malay name, date of birth and gender text.

Step S604. Determine whether the name and date of birth and gender text are in the thesaurus of name and date of birth and gender.

By comparing the name birth date and gender with the name gender birthday name thesaurus, if the name birth date gender text is in the name birth date gender thesaurus, then go to step S606, use the synonym; if not, go to step S606 S608, use the original name word.

Step S606. Convert the name text into synonymous name text.

Step S608. Name text preprocessing.

Step S610. Name text segmentation.

Step S612. Obtain a name word list.

Step S614. Obtain a name rule set.

Step S616. Match the name word list with the name rule set.

Step S618. Determine whether the name word is in the name rule set.

Match the name word segmentation list with the name rule set. If the name word is not in the name rule set, keep it and go to step S620; if the name word is in the name rule set, discard the name word.

Step S620. Obtain the reserved name word list.

Step S622. Obtain the normalized standardized name text.

The final reserved name word list is sequentially merged to obtain the normalized normalized name text.

Due to the complex source of name data and the special Malay names, as well as a series of problems such as misspellings, abbreviations, and joint writing of English words, in order to improve the accuracy and recall rate of name normalization, it is necessary to perform name synonym discovery on the full amount of data. . FIG. 7 is a complete flowchart of generating an information text thesaurus according to an embodiment of the present disclosure, and the information text thesaurus is the name, date of birth, and gender thesaurus in the above step S604. The specific steps of the flow chart are as follows:

Step S702. Acquire full data.

Step S704. Generate a first category ID according to the date of birth and gender.

For the data of name, date of birth, and gender in the full data from multiple sources, an ID is generated according to the gender and date of birth, that is, the first category ID.

Step S706. Aggregate the data according to the first classification ID.

The data is aggregated according to the first classification ID, that is, the same IDs are aggregated together.

Step S708. Classify the aggregated data according to the second classification ID. Wherein, the second category ID is generated according to the name.

The data aggregated by the first classification ID is then classified by name to obtain the second classification ID. The classification algorithm used is the Kmeans algorithm. The strategy for generating classification clusters is that when the number of name lists is greater than 2, take Two-thirds of the number of name lists is used as a classification cluster; when the number of name lists is less than or equal to 2, the classification cluster is set to 1. The purpose of this strategy is mainly to divide the data into multiple classes as much as possible, in order to reduce the number of computations as much as possible and improve the computational efficiency in the subsequent calculation of similarity.

Step S710. Generate an aggregate ID according to the first category ID and the second category ID.

A new aggregate ID, ie, NID, is generated according to the first category ID and the second category ID.

Step S712. Aggregate the data according to the aggregation ID.

Aggregate according to NID, and aggregate data with the same NID together.

Step S714. Calculate the aggregated data according to the similarity of names.

Step S716. Determine whether the name similarity is greater than 0.97.

If the similarity of names is greater than 0.97, enter the similarity data in step S718.

Step S718. Manually confirm similar data.

For some special name cases, manual intervention can be performed, and it will not affect the normalization processing of other cases.

Step S720. Generate a thesaurus of name, date of birth and gender.

Since the name in Malay is composed of title + duplicate name + first name + title + parent title + parent duplicate name + parent first name, including title, duplicate name, title, parent title, parent duplicate name They are all variable parts and may change over time. Therefore, it is necessary to organize the extracted feature word types such as title (parent title in the package name), duplicate name and title name to form a name rule set. Rulesets need to be discovered before normalization. FIG. 8 is a complete flowchart of generating a text component rule set in an embodiment of the present disclosure, where the text component rule set is the name rule set in the above step S616. The specific steps of the flow chart are as follows:

Step S802. Obtain the Malay name text.

Step S804. Name text preprocessing.

The process of preprocessing Malay names can include removing some special characters, such as ")", "(", "." and other symbols; in addition, it also needs to remove some meaningless special words, such as "unknown", " B/O", "Baby of" and other words.

Step S806. Name text segmentation.

Tokenize the name text to get a list of name words.

Step S808. Obtain a name rule set.

Step S810. Compare the similarity between the name text word segmentation and the name rule set.

Step S812. Determine whether the similarity is greater than 0.95.

If the similarity is greater than 0.95, it is considered as a possible rule set, and the process goes to step S814.

Step S814. Manual annotation.

Manual annotation of possible rule sets.

Step S816. Determine whether the requirements are met, and if so, add the word segmentation of the name text to the name rule set.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the figures in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps must be performed to achieve the desired the result of. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and the like.

Further, the present disclosure also provides a text standardization processing device. Referring to FIG. 9 , the text standardization processing apparatus may include an original information text acquisition module 910 , an original information text matching module 920 , a valid text component acquisition module 930 , a standard text component determination module 940 and a normalized text generation module 950 . in:

The original information text obtaining module 910 is configured to execute obtaining the original information text, and the original information text includes the original text to be processed;

The original information text matching module 920 is configured to perform matching on the original information text according to the pre-generated information text thesaurus to obtain the target text corresponding to the original text in the original information text;

The effective text component acquisition module 930 is configured to perform word segmentation processing on the target text to obtain each effective text component contained in the target text;

The standard text component determination module 940 is configured to execute the acquisition of a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set among the valid text components as standard text components;

The normalized text generation module 950 is configured to obtain normalized text corresponding to the original text according to the standard text components.

In some exemplary embodiments of the present disclosure, the original information text matching module 920 may include a first target text determination unit and a second target text determination unit. in:

The first target text determining unit is configured to execute, if there is a target information text related to the original information text in the information text thesaurus, then use the target text contained in the target information text as the target text corresponding to the original text;

The second target text determination unit is configured to perform, if there is no target information text related to the original information text in the information text thesaurus, taking the original text as the target text.

In some exemplary embodiments of the present disclosure, the valid text component obtaining module 930 may include an invalid component filtering unit and a target text word segmentation unit. in:

The invalid component filtering unit is configured to perform filtering processing of invalid text components in the target text;

The target text word segmentation unit is configured to perform word segmentation processing on the filtered target text to obtain each valid text component contained in the target text.

In some exemplary embodiments of the present disclosure, the apparatus for standardizing text provided by the present disclosure may further include an information text thesaurus generating module. in:

The information text thesaurus generating module may include a historical information text acquisition unit, a historical information text classification unit, and a thesaurus generating unit.

The historical information text acquisition unit is configured to perform acquisition of historical information text, historical text contained in the historical information text, and data information corresponding to the historical text;

The historical information text classification unit is configured to perform classification of the historical information text according to the historical text and the data information corresponding to the historical text to obtain multiple sets of similar information texts;

The thesaurus generating unit is configured to perform generating an informative text thesaurus from a plurality of sets of similar informative text sets.

In some exemplary embodiments of the present disclosure, the historical information text classification unit may include a first classification identification determination unit, a first classification set determination unit, a second classification set determination unit, a third classification set determination unit, and a cosine similarity calculation unit unit. in:

The first classification identification determining unit is configured to obtain the first classification identification of the historical information text according to the data information corresponding to the historical text;

The first classification set determining unit is configured to perform classifying the historical information texts according to the first classification identifiers to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information texts in each first classification set are the same;

The second classification set determining unit is configured to perform obtaining a second classification identification of the historical information text according to the historical text, and to classify the historical information text in each first classification set again according to the second classification identification through a preset clustering algorithm. , to obtain multiple second classification sets;

The third classification set determining unit is configured to obtain an aggregated identifier according to the first classification identifier and the second classification identifier, and to reclassify the historical information texts in each of the second classification sets according to the aggregated identifier to obtain a plurality of third classifications gather;

The cosine similarity calculation unit is configured to perform, for each historical information text in the third classification set, calculate the cosine similarity between the historical texts contained in the historical information text, and set the cosine similarity greater than the first similarity. Thresholded historical infotexts are put into the same set of similar infotexts.

In some exemplary embodiments of the present disclosure, the second classification set determination unit may include a cluster number determination unit and an information text division unit. in:

The cluster number determination unit is configured to determine the number of clusters corresponding to each first classification set according to the total number of historical information texts in each first classification set;

The information text dividing unit is configured to divide the historical information texts in each first classification set into a plurality of second classification sets corresponding to the number of clusters by using a preset clustering algorithm according to the second classification identification.

In some exemplary embodiments of the present disclosure, the cluster number determination unit may include a first cluster number determination unit and a second cluster number determination unit. in:

The first cluster number determination unit is configured to execute, if the total number of historical information texts in the first classification set is greater than or equal to the text quantity threshold, determine the cluster corresponding to the first classification set according to the total number of historical information texts and the preset ratio. number of clusters;

The second cluster number determination unit is configured to obtain a preset number of clusters as the number of clusters corresponding to the first classification set if the total number of historical information texts in the first classification set is less than or equal to the text quantity threshold.

In some exemplary embodiments of the present disclosure, the apparatus for normalizing text provided by the present disclosure may further include a text component rule set generating module. in:

The text component rule set generation module may include a historical text acquisition unit, a valid text component acquisition unit, a cosine similarity calculation unit, and a rule set generation unit.

The historical text acquisition unit is configured to perform acquisition of historical text contained in the historical information text;

The effective text component acquisition unit is configured to perform word segmentation processing on the historical text to obtain each effective historical text component contained in the historical text;

The cosine similarity calculation unit is configured to perform cosine similarity calculation between the valid historical text components and the text components in the text component rule set;

The rule set generation unit is configured to perform adding the valid historical text components to the text component rule set if the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than a second similarity threshold.

The specific details of each module/unit in the standardization processing apparatus of the above text have been described in detail in the corresponding method embodiment section, and will not be repeated here.

FIG. 10 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present invention.

It should be noted that the computer system 1000 of the electronic device shown in FIG. 10 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.

As shown in FIG. 10, a computer system 1000 includes a central processing unit (CPU) 1001, which can be loaded into a random access memory (RAM) 1003 according to a program stored in a read only memory (ROM) 1002 or a program from a storage section 1008 Instead, various appropriate actions and processes are performed. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1001, the ROM 1002, and the RAM 1003 are connected to each other through a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004 .

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1008 including a hard disk, etc. ; and a communication section 1009 including a network interface card such as a LAN card, a modem, and the like. The communication section 1009 performs communication processing via a network such as the Internet. A drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1010 as needed so that a computer program read therefrom is installed into the storage section 1008 as needed.

In particular, the processes described above with reference to the flowcharts may be implemented as computer software programs according to embodiments of the present invention. For example, embodiments of the present invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 1009, and/or installed from the removable medium 1011. When the computer program is executed by the central processing unit (CPU) 1001, various functions defined in the system of the present disclosure are executed.

It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

As another aspect, the present disclosure also provides a computer-readable medium. The computer-readable medium may be included in the electronic device described in the above embodiments; it may also exist alone without being assembled into the electronic device. middle. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by an electronic device, causes the electronic device to implement the methods described in the above-mentioned embodiments.

It should be noted that although several modules of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, in accordance with embodiments of the present disclosure, the features and functions of two or more modules described above may be embodied in one module. Conversely, the features and functions of one module described above can be further divided into multiple modules to be embodied.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure .

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

A method of normalizing text, including:

Obtain original information text, the original information text includes the original text to be processed;

Matching the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;

Perform word segmentation processing on the target text to obtain each valid text component contained in the target text;

Obtaining a pre-generated text component rule set, and using the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;

The standardized text corresponding to the original text is obtained according to the standard text components.
The standardization processing method of text according to claim 1, wherein the original information text is matched according to a pre-generated information text thesaurus to obtain a target corresponding to the original text in the original information text text, including:

If there is a target information text related to the original information text in the information text thesaurus, the target text contained in the target information text is used as the target text corresponding to the original text;

If there is no target information text related to the original information text in the information text thesaurus, the original text is used as the target text.
The standardization processing method of text according to claim 1, wherein, by performing word segmentation processing on the target text, each valid text component contained in the target text is obtained, comprising:

filtering the invalid text components in the target text;

Perform word segmentation processing on the filtered target text to obtain each effective text component contained in the target text.
The standardization processing method of text according to claim 1, wherein the generating method of the information text thesaurus comprises:

Obtain historical information text, the historical text contained in the historical information text, and the data information corresponding to the historical text;

According to the historical text and the data information corresponding to the historical text, classify the historical information text to obtain multiple sets of similar information texts;

The information text thesaurus is generated according to the sets of similar information texts.
The standardization processing method of text according to claim 4, wherein the historical information text is classified according to the historical text and the data information corresponding to the historical text to obtain a plurality of sets of similar information texts, comprising: :

Obtain the first classification identifier of the historical information text according to the data information corresponding to the historical text;

Classify the historical information text according to the first classification identifier to obtain a plurality of first classification sets, wherein the first classification identifiers of the historical information text in each of the first classification sets are the same;

A second classification identifier of the historical information text is obtained according to the historical text, and the historical information texts in each of the first classification sets are reclassified by a preset clustering algorithm according to the second classification identifier, to obtain a plurality of second classification sets;

Obtain an aggregate identifier according to the first classification identifier and the second classification identifier, and re-classify the historical information texts in each of the second classification sets according to the aggregate identifier to obtain a plurality of third classification sets;

For each of the historical information texts in the third classification set, calculate the cosine similarity between the historical texts included in the historical information text, and calculate the cosine similarity between the historical texts that are greater than the first similarity threshold. Put the historical information texts into the same set of similar information texts.
The method for standardizing texts according to claim 5, wherein the historical information texts in each of the first classification sets are re-classified according to the second classification identifiers through a preset clustering algorithm to obtain multiple classifications. A second classification set, including:

Determine the number of clusters corresponding to each of the first classification sets according to the total number of the historical information texts in each of the first classification sets;

According to the second classification identifier, the historical information text in each of the first classification sets is divided into a plurality of second classification sets corresponding to the number of clusters by a preset clustering algorithm.
The method for standardizing texts according to claim 6, wherein the determining the number of clusters corresponding to each of the first classification sets according to the total number of the historical information texts in each of the first classification sets, comprising: :

If the total number of the historical information texts in the first classification set is greater than or equal to the text quantity threshold, then determine the number of clusters corresponding to the first classification set according to the total number of the historical information texts and a preset ratio;

If the total number of the historical information texts in the first classification set is less than or equal to the text quantity threshold, a preset number of clusters is acquired as the number of clusters corresponding to the first classification set.
The method for standardizing text according to claim 1, wherein the method for generating the text component rule set comprises:

Obtain the historical text contained in the historical information text;

Perform word segmentation processing on the historical text to obtain each effective historical text component contained in the historical text;

performing cosine similarity calculation on the effective historical text components and the text components in the text component rule set;

If the cosine similarity between the valid historical text components and the text components in the text component rule set is greater than a second similarity threshold, the valid historical text components are added to the text component rule set.
A text standardization processing device, comprising:

an original information text acquisition module, configured to execute and acquire original information text, the original information text includes the original text to be processed;

an original information text matching module, configured to perform matching on the original information text according to a pre-generated information text thesaurus to obtain a target text corresponding to the original text in the original information text;

an effective text component acquisition module, configured to perform word segmentation processing on the target text to obtain each effective text component contained in the target text;

a standard text component determination module, configured to execute and acquire a pre-generated text component rule set, and use the valid text components that do not belong to the text component rule set in each of the valid text components as standard text components;

The standardized text generation module is configured to obtain standardized text corresponding to the original text according to the standard text components.
An electronic device comprising:

processor; and

memory for storing one or more programs which, when executed by said one or more processors, cause said one or more processors to implement any one of claims 1 to 8 A method for normalizing the text described in Item.
A computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the normalization processing method of text according to any one of claims 1 to 8.