WO2017071370A1

WO2017071370A1 - Label processing method and device

Info

Publication number: WO2017071370A1
Application number: PCT/CN2016/094417
Authority: WO
Inventors: 张传武; 梅峰; 李晓明; 邢加和
Original assignee: 华为技术有限公司
Priority date: 2015-10-30
Filing date: 2016-08-10
Publication date: 2017-05-04

Abstract

A label processing method and a device are provided. The method comprises: acquiring to-be-processed label (S101); selecting a similar words set to which the to-be-processed label belongs from a similar words database including the similar words set (S102); and defining a representative word of the similar words set to which the to-be-processed label belongs as the substitute label of the to-be-processed label (S103). Since each similar words set in the similar words database includes member words and the representative word selected from the member words of the similar words set, the method and apparatus described above use one representative word of the similar words set to take the place of the to-be-processed label, and for different but similar labels, the final substitute labels are the same. Therefore, through configuration and maintenance of the similar words set in the similar word database, the application of using uniformed labels to represent the same category or content can be achieved.

Description

Label processing method and device

The present application claims priority to Chinese Patent Application No. 201510727878.X filed on Oct. 30, 2015, entitled "A Label Processing Method and Apparatus", the entire contents of which are incorporated herein by reference. In the application.

Technical field

The present application relates to the field of electronic information, and in particular, to a label processing method and apparatus.

Background technique

Tag is used to mark the classification or content of the target. It is a kind of content organization. It is a kind of special metadata. It is a summary of the subjective feelings of the resource by the labeling party. It is used by users to describe and classify resources so as to facilitate Search and share.

However, even for the same category or content, there are a variety of expression vocabulary, for example, "fashion" and "fashion" express a meaning, therefore, the label currently output by the label system (whether it is automatically generated by the system or User-defined labels) may be similar words, that is, the label system may use different similar labels to represent the same category or content. For example, the label system's label for a garment output is sometimes "fashion" and sometimes "fashionable".

It can be seen that how to use the unified label to express the same classification or content has become an urgent problem to be solved.

Summary of the invention

The present application provides a method and apparatus for processing a note, with the aim of solving the problem of how to use the same label to express the same category or content.

In order to achieve the above object, the present application provides the following technical solutions:

A first aspect of the present application provides a label processing method, including:

Get the label to be processed;

Selecting a similar word set to which the to-be-processed tag belongs from a similar vocabulary, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, the representative word Selected from the member words;

A representative word of the similar word set to which the to-be-processed tag belongs is used as a substitute tag of the to-be-processed tag.

Based on the first aspect, in a first implementation manner of the first aspect, the specific process that the representative word is selected from the member words includes:

Count the number of times each member word has historically been treated as a tag to be processed;

Count the reverse file frequency IDF value of each member word;

For each member word, calculate the product of its history as the number of tags to be processed and the IDF value;

The member word with the largest product is used as the representative word.

Based on the first aspect or the first implementation of the first aspect, in a second implementation manner of the first aspect, the selecting a similar word set to which the to-be-processed tag belongs from the similar vocabulary includes:

Finding the to-be-processed tag from the similar vocabulary;

If the similar vocabulary includes the to-be-processed tag, the similar word set including the to-be-processed tag is used as a similar word set to which the to-be-processed tag belongs;

If the similar tag is not included in the similar vocabulary, the similar words of the tag to be processed are searched from the similar vocabulary, and a similar word set of similar words including the tag to be processed is included. As a similar set of words to which the tag to be processed belongs.

Based on the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the searching for the similar words of the to-be-processed label from the similar vocabulary includes:

Finding a first type of related words from the corpus, the first type of related words being words that appear together with the to-be-processed label in a corpus with a frequency greater than a first threshold;

Searching, in the corpus, a second type of related words, wherein the second type of related words are words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold;

The frequency of the second type of related words and the to-be-processed tags coexisting in a corpus is counted, and the second type of related words whose statistics are less than the third threshold is used as the similar words of the to-be-processed tags.

Based on the first aspect or the first implementation of the first aspect, in a fourth implementation manner of the first aspect, after searching for the similar words of the to-be-processed tag from the similar vocabulary, the method further includes:

The tag to be processed is added to a similar set of words including similar words of the tag to be processed.

Based on the first aspect or the first implementation of the first aspect, the fifth implementation in the first aspect In the formula, it also includes:

If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.

A second aspect of the present application provides a processing apparatus for a label, including:

An obtaining module, configured to obtain a label to be processed;

a processing processing module, configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, The representative word is selected from the member words;

And an alternative module, configured to use a representative word of the similar word set to which the to-be-processed tag belongs as a substitute tag of the to-be-processed tag.

Based on the second aspect, in the first implementation manner of the second aspect, the method further includes:

a determining module for counting the number of times each member word is historically treated as a tag to be processed; counting the reverse file frequency IDF value of each member word; for each member word, calculating the number of times it is historically treated as a tag to be processed The product of the IDF value; the member word with the largest product is used as the representative word.

Based on the second aspect or the first implementation of the second aspect, in a second implementation manner of the second aspect, the processing module is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs include:

The processing module is specifically configured to: search for the to-be-processed tag from the similar vocabulary; if the similar vocabulary includes the to-be-processed tag, include a similar word set of the to-be-processed tag As a similar word set to which the to-be-processed tag belongs; if the similar-word library does not include the to-be-processed tag, look up the similar words of the to-be-processed tag from the similar term library, and include The similar word set of the similar words of the to-be-processed tag is used as the similar word set to which the to-be-processed tag belongs.

Based on the second implementation of the second aspect, in a third implementation manner of the second aspect, the processing module is configured to search for the similar words of the to-be-processed tag from the similar vocabulary, including:

The processing module is specifically configured to: search for a first type of related word from the corpus, where the first type of related word is a word that appears in the corpus together with the to-be-processed tag and has a frequency greater than a first threshold; Finding a second type of related words in the corpus, the second type of related words being common to the first type of related words a word that appears in a corpus with a frequency greater than a second threshold; a frequency at which a second type of associated word appears in a corpus together with the tag to be processed, and a second type of associated word whose statistic frequency is less than a third threshold A similar word as the label to be processed.

Based on the second aspect or the first implementation of the second aspect, in a fourth implementation manner of the second aspect, the processing module is further configured to:

After searching for similar words of the to-be-processed tag from the similar vocabulary, the tag to be processed is added to a similar word set including similar words of the tag to be processed.

Based on the second aspect or the first implementation of the second aspect, in a fifth implementation manner of the second aspect, the processing module is further configured to:

If the similar word pool does not exist in the similar vocabulary, the label to be processed is added to the similar vocabulary, and the label to be processed is new in the similar vocabulary. Member words and representative words of similar word sets.

The label processing method and device of the present application acquires a tag to be processed, selects a similar word set to which the tag to be processed belongs from a similar vocabulary including a similar word set, and represents a similar word set to which the tag to be processed belongs. The word is used as a substitute label for the label to be processed, because any similar word set in the similar vocabulary includes the member word and the representative word selected from the member words in the similar word set, so the method and method described in the present application The device replaces the tag to be processed with a representative word of a similar word set, so for different but similar tags, the final surrogate tag is the same (ie, the representative word of a similar phrase), so, by using in a similar lexicon The configuration and maintenance of a similar set of words can achieve the purpose of expressing the same classification or content using a uniform label.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.

FIG. 1 is a flowchart of a label processing method according to an embodiment of the present application;

2 is a flowchart of still another method for processing a label according to an embodiment of the present application;

FIG. 3 is a schematic diagram of searching for similar words of a tag to be processed from a similar vocabulary according to an embodiment of the present application; Flow chart of the body process;

4 is a flowchart of a specific process for selecting a representative word from a member word according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a label processing apparatus according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a label processing device according to an embodiment of the present disclosure.

detailed description

The embodiment of the present application can be applied to a scenario in which a tag is set or output. For example, the server outputs a label for an Internet site, or the operator user is marked with a behavior preference tag. The execution body of the method of the present application may be a server.

The execution body of the label processing method described in this embodiment may be a server that sets or outputs a label.

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

A label processing method disclosed in the embodiment of the present application, as shown in FIG. 1 , includes the following steps:

S101: Acquire a label to be processed.

The tag to be processed can be a tag set by the user or a tag generated by the tag system.

S102: Determine whether a similar word set to which the tag to be processed belongs is selected from a similar vocabulary including a similar word set, and if yes, execute S103, and if no, execute S104.

In this embodiment, the similar word set to which the tag to be processed belongs is a similar word set including similar words of the tag to be processed or the tag to be processed.

Wherein, any similar word set included in the similar vocabulary includes a member word and a representative word, and the representative words of the set are selected from the member words of the set.

For example, as shown in Table 1, a part of a similar lexicon, including four similar word sets numbered 0001 to 0004, each of which includes a member word and a representative word, such as a similar phrase numbered 0001, “Model”, “Madou” and “model” are all member words, and “model” is a representative word.

Table 1

组编号Group number	成员词1Member word 1	成员词2Member word 2	成员词3Member word 3	…...	代表词Representative word
00010001	模特Model	麻豆Madou	modelModel	…...	模特Model
00020002	笑话joke	冷笑话cold joke		…...	笑话joke
00030003	时尚fashion	时髦fashionable	前卫Avantgarde	…...	时尚fashion
00040004	地瓜Sweet potato	红薯sweet potato	番薯Sweet potato	…...	红薯sweet potato
…...	…...	…...	…...	…...

Specifically, one of the member words of the group may be arbitrarily selected as the representative word of the group, or the representative words may be selected from the group words of the group according to other rules. Specific rules are described in the following embodiments.

In this embodiment, the similar words are similar words, that is, synonyms, and the specific definitions thereof can be referred to the prior art.

S103: The representative word of the similar word set to which the tag to be processed belongs is used as a substitute tag of the tag to be processed.

For example, the label to be processed is “Madou”, because the similar word set numbered 0001 includes the label to be processed as “Madou”, and the representative word “Model” of the 0001 collection is used as a substitute label.

S104: If there is no similar word set to which the to-be-processed tag belongs in the similar lexicon, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is a member word and representative word of the new similar phrase in the similar vocabulary. .

That is to say, if the label to be processed or its similar words is not found in the similarity vocabulary, it means that there is no label in the similarity vocabulary or a word similar to the meaning of the label, in this case, the pending The tag is added to the similar lexicon as a new word, and as a member of the new similar phrase in the similar lexicon, because there are no other words in the new similar phrase, the tag to be processed also serves as the representative of the similar phrase. .

For example, if the label to be processed is "art" and there is no similar word in "art" in Table 1, "art" is taken as the member word and representative of the new similar phrase 0005. The updated similar vocabulary is shown in Table 2.

Table 2

组编号Group number	成员词1Member word 1	成员词2Member word 2	成员词3Member word 3	…...	代表词Representative word
00010001	模特Model	麻豆Madou	modelModel	…...	模特Model
00020002	笑话joke	冷笑话cold joke	段子Paragraph	…...	笑话joke
00030003	时尚fashion	时髦fashionable	前卫Avantgarde	…...	时尚fashion
00040004	地瓜Sweet potato	红薯sweet potato	番薯Sweet potato	…...	红薯sweet potato
00050005	艺术art	…...	…...	…...	艺术art

The purpose of S104 is to continuously enrich the words in similar lexicons, thus laying the foundation for the unification of labels.

It can be seen from the foregoing steps that the label processing method in the embodiment can find a home set for the label to be processed, and replace the label to be processed with the representative word of the home set, and therefore, represent the same category or content. The same feature can always be described using the same tag, avoiding the problem of multiple synonyms describing the same feature of the same category or content.

Another method for label processing disclosed in the embodiment of the present application differs from the above embodiment in that the present embodiment focuses on how to determine similar words of a label to be processed in a similar vocabulary, as shown in FIG. The label processing method described in the example includes the following steps:

S201: Acquire a label to be processed;

S202: Find a label to be processed from a similar vocabulary, if the similar vocabulary includes a label to be processed, perform S203 and S206, otherwise, execute S204;

S203: A similar word set including a label to be processed is used as a similar word set to which the label to be processed belongs.

S204: Find similar words of the tag to be processed from the similar lexicon, if the similar words of the tag to be processed are found from the similar vocabulary, execute S205, otherwise, execute S208;

Specifically, the specific process of searching for similar words of a tag to be processed from a similar vocabulary is as shown in FIG. 3, and includes the following steps:

S301: Find a word from the corpus that appears in the corpus together with the label to be processed, whose frequency is greater than the first threshold, and is recorded as the first type of related word.

In the present embodiment, the term "related words" refers to words that are related to meaning. A collection of texts is called a corpus, and corpus generally refers to materials used for text analysis. Usually, corpus can be a text corpus obtained by crawling an Internet website using a crawler. The first type of related words may be determined from the corpus by using an existing association algorithm, and the first type of related words is a word related to the meaning of the label to be processed.

S302: Searching in the corpus for a word that appears in a corpus together with the first type of related words and has a frequency greater than a second threshold, and is recorded as a second type of related word.

In this embodiment, the frequency in which the related words A and B appear together in one corpus = the number of simultaneous occurrences of both / the total number of occurrences of the associated word A. The frequency at which the associated words B and A appear together in a corpus = The number of simultaneous occurrences of both / the total number of occurrences of the associated word B.

S303: Count the frequency of the second type of related words and the tags to be processed together in a corpus.

S304: The second type of related words whose statistics frequency is less than the third threshold is used as the similar word of the label to be processed.

The steps shown in Figure 3 were analyzed three times, namely:

Through the first corpus analysis, the words related to the meaning of the label to be processed are obtained first, that is, the first type of related words, because the first type of related words are related to the label to be processed, so the similar words of the label to be processed may also be related to A kind of related words are related. For example, the label to be processed is “model”, and the first type of related word is “T station”, and “T station” is also the related word of “Madou” similar to “model”.

It can be seen that the related words of the first type of related words may include similar words of the to-be-processed tags. Therefore, through the second corpus analysis: the purpose of finding the related words of the first type of related words, that is, the second type of related words, is to find the similarity of the tags to be processed. word.

Because it is unlikely that similar words will appear together in a corpus, for example, it is unlikely that a “model” and “madou” will appear in a corpus, and a single word (“model” or “hemp” is used uniformly. Bean") is more likely to express the same meaning. Therefore, in the third corpus analysis, the frequency of the second type of related words and the tags to be processed appear together in a corpus, and the words whose frequency is less than a certain threshold are used as Similar words for the tag to be processed.

Specifically, the specific process of searching for similar words of "Madou" using the above method is: corpus segmentation obtains the first type of related words {"Taiwan", "fashion", "girl", "magazine", "model"}, hypothesis The frequencies of the first type of related words and "Madou" appear together in the same corpus are: T stage 0.50, fashion 0.43, girl 0.10, magazine 0.17, model 0.09; the first type of related words with the item set frequency greater than 0.2 Correlation phrase C, get C = {"T stage", "fashion"}; "T stage", "fashion" are all related to the analysis, the two words are respectively related to the analysis and put the obtained words in a collection, Get the set E = {"model", "model", "girl"}, calculate the frequency of each word in the set E and "Madou" appear together in the same corpus, assuming: "model" 0.11, " Model"0.60, "Girl" 0.70; again threshold screening (the threshold is assumed to be 0.5) to get {"model"}, "model", "girl" is filtered out, "model" is the similar word of "Madou" .

S205: A similar word set of similar words including the label to be processed is taken as a similar word set to which the label to be processed belongs. S206: Treat the representative words of the similar word set to which the tag to be processed belongs to be processed An alternative label for the label.

S207: Add the label to be processed to a similar word set including similar words of the label to be processed.

It should be noted that S207 may also be executed before S205 or S206.

S208: Add the to-be-processed tag into a similar vocabulary, and the to-be-processed tag is a member word and a representative word of a new similar word set in a similar vocabulary.

The label processing method described in this embodiment uses the related words as an intermediary to determine the similar words of the tags to be processed, thereby implementing the unification of the labels.

It should be noted that, in this embodiment, whether the label to be processed is included in the similar vocabulary is searched, and if the label to be processed is not included in the similar vocabulary, whether the similar vocabulary includes the label to be processed is found. Similar words, the purpose of this search order is to improve the efficiency of the search, that is, to perform a simple search operation, and if not found, then perform a complex search operation. In addition, other search methods can also be used, which are not limited herein.

It should be noted that, for the similar thesaurus shown in the above embodiment, the representative word can be obtained through improvement based on the existing TF-IDF (term frequency _- inverse document frequency) algorithm. , that is, according to the method shown in FIG. 4, the representative words of any similar word set are determined:

S401: Count the number of times each member word is historically regarded as a to-be-processed tag;

That is, if the tag to be processed is a member word, the number of times the member word is incremented by one.

S402: Count an inverse document frequency (IDF) value of each member word;

S403: Calculate, for each member word, a product of the number of times of the label as a to-be-processed label and the IDF value;

S404: The member word with the largest product is used as a representative word.

The process of determining the representative word may be performed before the label processing process, or may be performed before the label processing process using a representative word (for example, S103 or S205), which is not limited herein.

Corresponding to the above method embodiment, the embodiment of the present application further discloses a label processing apparatus. As shown in FIG. 5, the method includes: an obtaining module 501, a processing module 502, and a replacement module 503.

The obtaining module 501 is configured to acquire a label to be processed.

The processing module 502 is configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, and the representative Words are selected from the member words.

The substitution module 503 is configured to use a representative word of a similar word set to which the tag to be processed belongs as a substitute tag of the to-be-processed tag.

Optionally, the apparatus in this embodiment may further include: a determining module 504, configured to count the number of times each member word is historically used as a label to be processed; and calculate a reverse file frequency IDF value of each member word; The member words are calculated as the product of the number of times the history is to be processed as a label and the IDF value; the member word with the largest product is used as the representative word.

Optionally, the processing module 502 is further configured to: after searching for the similar words of the to-be-processed tag from the similar vocabulary, adding the to-be-processed tag to a similar word including the to-be-processed tag. A collection of similar words.

Optionally, the processing module 502 is further configured to add the to-be-processed tag to the similar word library if the similar word set to which the to-be-processed tag belongs does not exist in the similar thesaurus, the to-be-processed The processed tags are the member words and representative words of the new similar word set in the similar lexicon.

Specifically, the specific implementation manner of the processing module selecting the similar word set to which the to-be-processed tag belongs from the similar vocabulary is: searching for the to-be-processed tag from the similar vocabulary; if the similar vocabulary is Including the to-be-processed tag, the similar word set including the to-be-processed tag is used as the similar word-genus set of the to-be-processed tag; if the similar-word library does not include the to-be-processed tag, Searching for the similar words of the to-be-processed tag from the similar vocabulary, and using the similar word set of the similar words of the to-be-processed tag as the similar word set to which the to-be-processed tag belongs.

Further, the specific implementation manner of the processing module searching for the similar words of the to-be-processed tag from the similar vocabulary may be: searching for a first type of related words from the corpus, the first type of related words being the to-be-processed a label that co-occurs in a corpus with a frequency greater than a first threshold; and finds a second type of related word in the corpus, the second type of related word being co-occurring in a corpus with the first type of associated word a word whose frequency is greater than a second threshold; a frequency at which a second type of related word appears together with the to-be-processed tag in a corpus, and a second type of related word whose statistical frequency is less than a third threshold is used as a A similar word for the label being processed.

The tag processing apparatus according to this embodiment may be disposed on a server for data processing, such as a web server, etc., and is beneficial for converting different tags expressing the same category or content into a unified tag.

The embodiment of the present application further discloses a label processing device, as shown in FIG. 6, comprising: a processor 601, a memory 602, a communication interface 603, and a bus 604.

The processor 601, the memory 602, and the communication interface 603 communicate via the bus 604. The communication interface 603 is used to implement communication between the tag processing device and other devices.

The memory 602 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 602 can store program code of the operating system and other applications as well as application data. The program code stored in the memory 602 is executed by the processor 603.

The processor 603 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.

Bus 604 can include a path for communicating information between various components, such as processor 601, memory 602, and communication interface 603.

The processor 601 executes the program code stored in the memory 602 to implement the following functions: acquiring a label to be processed; selecting a similar word set to which the to-be-processed label belongs from a similar vocabulary, the similar vocabulary including a similar word set, Any one of the similar word sets includes a member word and a representative word, the representative word is selected from the member words; and a representative word of the similar word set to which the to-be-processed tag belongs is used as an alternative to the to-be-processed tag label.

For the specific implementation of the above functions of the processor 601, refer to the steps shown in FIG. 1, FIG. 2, FIG. 3 and FIG.

The label processing apparatus described in this embodiment facilitates converting different labels that express the same category or content into a unified label.

The functions described in the methods of the embodiments of the present application, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computing device readable storage medium. Based on this It is understood that the part of the embodiment of the present application that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a calculation The device (which may be a personal computer, server, mobile computing device, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts of the respective embodiments may be referred to each other.

The above description of the disclosed embodiments enables those skilled in the art to make or use the application. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the application is not limited to the embodiments shown herein, but is to be accorded the broadest scope of the principles and novel features disclosed herein.

Claims

A label processing method, comprising:

Get the label to be processed;

Selecting a similar word set to which the to-be-processed tag belongs from a similar vocabulary, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, the representative word from the member Elected in the word;

A representative word of the similar word set to which the to-be-processed tag belongs is used as a substitute tag of the to-be-processed tag.
The method according to claim 1, wherein the specific process of selecting the representative word from the member words comprises:

Count the number of times each member word has historically been treated as a tag to be processed;

Count the reverse file frequency IDF value of each member word;

For each member word, calculate the product of its history as the number of tags to be processed and the IDF value;

The member word with the largest product is used as the representative word.
The method according to any one of claims 1 to 2, wherein the selecting a similar word set to which the to-be-processed tag belongs from the similar vocabulary comprises:

Finding the to-be-processed tag from the similar vocabulary;

If the similar vocabulary includes the to-be-processed tag, the similar word set including the to-be-processed tag is used as a similar word set to which the to-be-processed tag belongs;

If the similar tag is not included in the similar vocabulary, the similar words of the tag to be processed are searched from the similar vocabulary, and a similar word set of similar words including the tag to be processed is included. As a similar set of words to which the tag to be processed belongs.
The method according to claim 3, wherein said searching for similar words of said to-be-processed tag from said similar vocabulary comprises:

Finding a first type of related words from the corpus, the first type of related words being words that appear together with the to-be-processed label in a corpus with a frequency greater than a first threshold;

Searching, in the corpus, a second type of related words, wherein the second type of related words are words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold;

The frequency of the second type of related words and the to-be-processed tags coexisting in a corpus is counted, and the second type of related words whose statistics are less than the third threshold is used as the similar words of the to-be-processed tags.
The method according to claim 3, further comprising: after searching for the similar words of the tag to be processed from the similar vocabulary, further comprising:

The tag to be processed is added to a similar set of words including similar words of the tag to be processed.
The method according to claim 1 or 2, further comprising:

If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.
A processing device for a tag, comprising:

An obtaining module, configured to obtain a label to be processed;

a processing module, configured to select, from a similar vocabulary, a similar word set to which the to-be-processed tag belongs, the similar vocabulary includes a similar word set, and any one of the similar word sets includes a member word and a representative word, and the representative Words are selected from the member words;

And an alternative module, configured to use a representative word of the similar word set to which the to-be-processed tag belongs as a substitute tag of the to-be-processed tag.
The device according to claim 7, further comprising:

a determining module for counting the number of times each member word is historically treated as a tag to be processed; counting the reverse file frequency IDF value of each member word; for each member word, calculating the number of times it is historically treated as a tag to be processed The product of the IDF value; the member word with the largest product is used as the representative word.
The device according to claim 7 or 8, wherein the processing module is configured to select, from the similar vocabulary, the similar word set to which the to-be-processed tag belongs:

The processing module is specifically configured to: search for the to-be-processed tag from the similar vocabulary; if the similar vocabulary includes the to-be-processed tag, include a similar word set of the to-be-processed tag As a similar word set to which the to-be-processed tag belongs; if the similar-word library does not include the to-be-processed tag, look up the similar words of the to-be-processed tag from the similar term library, and include The similar word set of the similar words of the to-be-processed tag is used as the similar word set to which the to-be-processed tag belongs.
The apparatus of claim 9 wherein said processing module is for said Similar words in the similar lexicon that look up the tag to be processed include:

The processing module is specifically configured to: search for a first type of related word from the corpus, where the first type of related word is a word that appears in the corpus together with the to-be-processed tag and has a frequency greater than a first threshold; Searching for a second type of related words in the corpus, the second type of related words being words that appear together with the first type of related words in a corpus with a frequency greater than a second threshold; statistical second type of related words and the to-be-processed The frequency at which the tags co-occur in a corpus, and the second type of related words whose statistic frequency is less than the third threshold is used as the similar word of the tag to be processed.
The device according to claim 9, wherein the processing module is further configured to:

After searching for similar words of the to-be-processed tag from the similar vocabulary, the tag to be processed is added to a similar word set including similar words of the tag to be processed.
The device according to claim 7 or 8, wherein the processing module is further configured to:

If the similar word set to which the to-be-processed tag belongs does not exist in the similar vocabulary, the tag to be processed is added to the similar vocabulary, and the to-be-processed tag is new in the similar vocabulary. Member words and representative words of similar word sets.