CN114969294A

CN114969294A - Expansion method of sound-proximity sensitive words

Info

Publication number: CN114969294A
Application number: CN202210613700.2A
Authority: CN
Inventors: 刘堡萱; 文成明
Original assignee: Shenzhen Baichuan Shuan Technology Co ltd
Current assignee: Shenzhen Baichuan Shuan Technology Co ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-30

Abstract

The invention provides an expansion method of a near-sensitive word, which comprises the following steps: combining every two phonetics in the spelling list; acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing to obtain a pinyin-near-pinyin list; replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin table, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character table, and constructing to obtain candidate near-voice words; and pre-judging the candidate phonetic near words to realize the supplementary expansion of the sensitive word database. By using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.

Description

Expansion method of sound-proximity sensitive words

Technical Field

The invention relates to the technical field of vocabulary expansion, in particular to an expansion method of a near-speech sensitive word.

Background

In an internet community scene, due to the particularity of the scene, a piece of information is often received by a great number of people, so that the phenomenon that some illegal platforms are guided by issuing some messages in the form of texts or pictures often occurs in the internet community, the guided messages are generally considered as junk contents, and publishers of the messages are generally considered as black products. Because the texts sent by the method often have some similar keywords, the words are used as sensitive words, and illegal contents can be timely processed by identifying whether the texts contain the sensitive words, so that the purpose of maintaining community atmosphere is achieved, namely the contents are anti-spam. But the poster of the illegal content often avoids the extracted sensitive words through an experimental means, so that the hit number of the sensitive words is reduced. Therefore, how to update the sensitive thesaurus is crucial to content anti-cheating. Most current platforms supplement the sensitive word stock by means of manual extraction.

Although the method for supplementing the sensitive word stock by manually extracting the keywords can achieve a certain anti-cheating effect, the method cannot solve the problem in the prior art, because each sensitive word is sent out through a black product and then is obtained through manual review and extraction, the whole process is long in time consumption, and no way is provided for ensuring that each illegal text is identified, so that a great amount of omission exists. Even if the keywords are extracted smoothly, the newly added keywords are not necessarily effective because the technology is updated quickly in black birth.

Therefore, the invention provides an expansion method of the sound-proximity sensitive words.

Disclosure of Invention

The invention provides an expansion method of a near-sound sensitive word, which is used for speculating a near-sound word variant of the sensitive word which is possibly used in black productions by using an existing sensitive word library and a near-sound word list so as to solve the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link in advance and further improve the effectiveness of the extracted keyword.

The invention provides an expansion method of a near-sensitive word, which comprises the following steps:

step 1: combining every two phonetics in the spelling list;

step 2: acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing a pinyin-near-pinyin table;

and step 3: replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin list, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character list, and constructing to obtain candidate near-voice words;

and 4, step 4: and pre-judging the candidate phonetic near words to realize the supplementary expansion of the sensitive word database.

In a possible implementation manner, obtaining an edit distance of each pairwise combination result, extracting a near-pinyin group according to the edit distance, and constructing a pinyin-near-pinyin table, including:

calculating the edit distance of the pairwise combination result;

judging whether the editing distance is greater than a preset distance;

if yes, the first pinyin and the second pinyin in the corresponding combination result are judged to be not near pinyin;

otherwise, judging that the first pinyin and the second pinyin in the corresponding combination result are phonetic near pinyin;

extracting all near-phonetic groups of the mutual near-phonetic alphabets, and constructing to obtain a pinyin-near-phonetic alphabet table.

In a possible implementation manner, the pre-judging is performed on the candidate near word, so as to implement the complementary expansion of the sensitive word database, including:

acquiring a judgment instruction of a candidate phonetic word by an artificial person, and judging whether the judgment instruction is related to a preset sensitive condition;

when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive words into a sensitive word database;

and when the judgment instruction is irrelevant to a preset sensitive condition, discarding the candidate sensitive word.

In one possible implementation, the legal pinyin table and the pinyin-character table only include pronunciation pinyins and no pronunciation tones.

In one possible implementation, combining two spellings in the legal spelling table comprises:

performing first classification according to the pinyin length of each pinyin;

performing second classification on the pinyin in each first classification result according to the pinyin initial type;

obtaining pinyin sets with different pinyin lengths based on the first classification result and the second classification result;

and combining the pinyin in the pinyin set in pairs.

In a possible implementation manner, when the determination instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word to a sensitive word database, including:

capturing a judgment instruction of the candidate phonetic proximal words by the manual work, and analyzing the judgment instruction to obtain a plurality of judgment conditions;

extracting the current judgment difference of each piece of judgment condition;

judging whether the corresponding current judgment difference is within a preset difference range or not based on a preset sensitive condition, and if so, judging that the corresponding sub-judgment condition is qualified;

and when the sub-judgment conditions are qualified, the judgment instruction is related to a preset sensitive condition.

In a possible implementation manner, the pre-determining is performed on the candidate phonetic proximal words, and in the process of implementing the supplementary expansion of the sensitive word database, the method further includes:

tracing the sensitive word database, extracting initial historical sensitive words related to the candidate phonetic near words, constructing initial derivative chains of different initial historical sensitive words, and optimizing the initial derivative chains by combining the crawling results of the network sensitive words to obtain historical derivative chains;

judging whether the initial sensitive words in each history derivative chain have alphabetic characters or not, regarding the history derivative chains with the alphabetic characters as a first type, and judging whether the rest history sensitive words in the history derivative chains without the alphabetic characters in the initial sensitive words have the alphabetic characters again;

according to the result of the second judgment, extracting the historical derivative chains without the alphabetic characters in the remaining sensitive words, and regarding the historical derivative chains as a second class, and meanwhile, regarding the remaining historical derivative chains as a third class;

determining an initial historical pinyin corresponding to the initial historical sensitive word in the historical derivative chain and a modified historical pinyin after each derivative, and acquiring historical pinyin tracing information;

counting historical search time sets and historical search frequency sets of different historical sensitive words in the same historical derivative chain, and constructing a first curve graph of the same historical derivative chain;

acquiring historical derivative total time length of the same historical derivative chain and derivative time length of each derivative stage according to the historical pinyin tracing information, and constructing to obtain a second graph;

pre-analyzing the first graph and the second graph based on a curve analysis model, and meanwhile, obtaining first sensitivity contribution degrees corresponding to different history sensitive words in the same history derivative chain and second sensitivity contribution degrees corresponding to the same history derivative chain by combining with the classification type of the same history derivative chain;

extracting first sensitive words with a first sensitivity contribution degree greater than a first preset sensitivity contribution degree from the same historical derivative chain, and sequencing all the first sensitive words in the same historical derivative chain according to the first sensitivity contribution degree;

when the second sensitivity contribution degree is greater than a second preset sensitivity degree, establishing a first mapping relation between the corresponding first sensitive word and the corresponding same historical derivative chain according to the sorting result;

otherwise, establishing a second mapping relation of the corresponding first sensitive words according to the sorting result;

based on a sensitive change capturing model, capturing and analyzing all the first mapping relations and the second mapping relations, outputting to obtain a sensitive change rule, and constructing a sensitive vocabulary auxiliary change list based on the sensitive change rule;

determining a vocabulary adjustment type for the candidate phonetic proximal words based on the sensitive vocabulary auxiliary change list;

and determining corresponding adjusted sound near words based on the vocabulary adjustment type.

In one possible implementation manner, obtaining a first sensitivity contribution degree corresponding to different history sensitive words in the same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain includes:

calculating the first sensitivity contribution degree B of different history sensitive words in the corresponding same history derivative chain according to the following formula _i (t,p)；

Where, sum { t _i The total search duration of a historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t _i A historical search time set representing a corresponding ith history sensitive word; sum { p } _i Representing the total search frequency of a historical search frequency set of the ith historical sensitive word in the corresponding same historical derivative chain; p is a radical of formula _i A history search frequency set representing a corresponding ith history sensitive word; g _i Representing the counted search frequency of the ith history sensitive word; t is _i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;

representing the historical derivative total time length of the corresponding same historical derivative chain;

representing the counted total search frequency of all history sensitive words in the corresponding history derivative chain;

calculating a second sensitivity contribution degree Y of the corresponding same historical derivative chain according to the following formula;

wherein the content of the first and second substances,

representing the classification type of the corresponding same historical derivative chain, and taking the value of 0.9 when the classification type is a first class, taking the value of 0.8 when the classification type is a second class, and taking the value of 1 when the classification type is a third class; n represents the total number of history sensitive words contained in the corresponding same history derivative chain; a. the _i Representing the historical modification distance of the ith history sensitive word in the corresponding same history derivative chain; b is _i (t, p) represents a first sensitivity contribution degree obtained by the ith history sensitive word in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;

representing the total sensitivity contribution degree of all history sensitive words in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;

representing a time derivative factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the second graph;

representing a sensitivity contribution factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the first graph;

and extracting the first sensitive words based on all the first sensitive contribution degrees and the second sensitive contribution degrees.

In a possible implementation manner, the current judgment difference refers to a difference between an actual judgment result of the same sub-judgment condition and a preset judgment standard of the sub-judgment condition.

In a possible implementation manner, in the process of replacing any pinyin of a sensitive word in the sensitive word database based on the pinyin-near pinyin table, the method includes:

determining a first word extension subset of each second sensitive word in the sensitive word database, and constructing a first extension sublist of each first word extension subset;

performing complete overlapping processing on sensitive words on each first extension sublist, and acquiring a plurality of second word extension sublets with the largest inclusion range, wherein the second word extension sublets comprise: a first extended subset consisting of only one single sensitive word and a second extended subset consisting of a plurality of sensitive words;

determining a first number of said first extended subset and a second number of second extended subset and comparing said first number with said second number;

if the first number is larger than or equal to the second number, preferentially setting a first replacement label for the first extension subset according to the use frequency of the sensitive words in the first extension subset, and replacing any character pinyin in the sensitive words in the corresponding first extension subset in sequence according to the label replacement time attribute of the first replacement label;

otherwise, preferentially determining a central sensitive word in the second extension subset, and determining the extension meaning of each extension sensitive word extending based on the central sensitive word, and a first pinyin modification position-time set of each extension sensitive word based on the central sensitive word and a second pinyin modification position-time set based on adjacent sensitive words;

acquiring the extension degree of the corresponding extension sensitive word based on the extension meaning, the first pinyin modification position-time set and the second pinyin modification position-time set;

converting the extension degree of each extension sensitive word in the corresponding second extension subset into a corresponding extension sequence according to a standard conversion rule, and obtaining the extension sequence corresponding to the second extension subset according to the sequence generation order of the extension sensitive words;

according to the size consistency of each sequence in the extension sequences, performing region splitting on the extension sequences to obtain a plurality of sub-regions, and respectively performing setting of second replacement labels on the corresponding sub-regions according to the mode sequence of each sub-region;

and sequentially replacing any character pinyin of the extension sensitive words in different sub-areas in the same second extension subset according to the label replacement time attribute of the second replacement label.

Compared with the prior art, the beneficial effects of this application are as follows:

by using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating a method for expanding a proximity sensitive word according to an embodiment of the present invention;

fig. 2 is a diagram illustrating an embodiment of a method for expanding a proximity sensitive word according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1:

the invention provides an expansion method of a near-sensitive word, which comprises the following steps as shown in figure 1:

step 1: combining every two phonetics in the spelling list;

Preferably, the legal pinyin list and the pinyin-character list only include pronunciation pinyins and do not include pronunciation tones.

In this embodiment, the edit distance may be calculated by a levenstein method, for example, the pairwise combination result is: pinyin li and pinyin ni, where the edit distance of pinyin li and ni is 1, since only one step is required to replace "l" with "n".

In this embodiment, the legal pinyin list may be formed by the existing chinese pinyin, and the pinyin-near pinyin list, for example, is formed by combining two combinations of the above-mentioned "li" and "ni" similar to each other, and represents a mapping relationship between a pinyin and a pinyin near the pinyin thereof. The pinyin-hanzi table may also be constructed based on existing pinyins and hanzi corresponding to the pinyins, and represents a mapping relationship between a pinyin and homonymous hanzi corresponding to the pronunciation of the pinyin.

In this embodiment, the sensitive word database is pre-constructed.

In this embodiment, the candidate phonetic word refers to a word obtained after the pinyin-near replacement.

In this embodiment, the pre-determination means that the candidate phonetic near words are determined manually, and if the candidate phonetic near words can be retained, the candidate phonetic near words are extended into the corresponding sensitive word database in a supplementary manner, and if the candidate phonetic near words cannot be retained, the candidate phonetic near words are directly removed. In this embodiment, any pinyin of the sensitive words in the sensitive word database is replaced based on the pinyin-near pinyin table, such as: the pinyin "zhuhousouu" can replace "zhi" in the pinyin with "zi", and a homophonic character table is used for mapping the pinyin with the replaced near pinyin into characters, namely "zhuhouu" can be mapped into "character search", and the specific flow is shown in fig. 2.

In the embodiment, the phonetic near characters are generated by defining the phonetic near pinyin according to the Levenson distance, then the sensitive word bank is expanded, possible variants of the sensitive words are generated in advance, the unreadable sensitive words are eliminated after manual review, and the unreadable variants are added into the sensitive word bank, so that the effect of recognizing junk texts is achieved.

Meanwhile, the solution of adding the sensitive words has the characteristics of high iteration speed and short period compared with some solutions using complex language recognition models, such as Bert.

The beneficial effects of the above technical scheme are: by using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.

Example 2:

based on the embodiment 1, the editing distance of each pairwise combination result is obtained, and a near-pinyin group is extracted according to the editing distance to construct and obtain a pinyin-near-pinyin list, which includes:

calculating the edit distance of the pairwise combination result;

judging whether the editing distance is greater than a preset distance;

In this embodiment, the preset distance may be set to 2.

The beneficial effects of the above technical scheme are: by distance comparison, the phoneme and the unvoiced phoneme can be effectively and preliminarily determined, and a basis is provided for subsequently predicting the phoneme word variants.

Example 3:

on the basis of the embodiment 1, the candidate phonetic near words are pre-judged, so that the complementary expansion of the sensitive word database is realized, and the method comprises the following steps:

acquiring a judgment instruction of a candidate phonetic near word by manpower, and judging whether the judgment instruction is related to a preset sensitive condition;

and when the judgment instruction is not related to a preset sensitive condition, discarding the candidate sensitive word.

In this embodiment, for example, if there is a sensitive word, a1, at this time, the corresponding judgment instruction is that a certain word in the sensitive word a1 may be used as a sensitive word, and at this time, the sensitive word also satisfies the preset sensitive condition, and it may be regarded as a correlation, and whether the expanded sensitive word is easily read or not, whether the meaning direction is obvious or not, whether there is ambiguity or not, and the like.

The beneficial effects of the above technical scheme are: whether the candidate sensitive words are discarded or not is determined by analyzing the manual judgment instruction, and the rationality of the sensitive word database is ensured.

Example 4:

based on the embodiment 1, the combination of every two pinyin in the legal pinyin list comprises:

performing first classification according to the pinyin length of each pinyin;

and combining the pinyin in the pinyin set in pairs.

In this embodiment, for example, first, according to the pinyin length: 2 lengths, 3 lengths and the like, and secondly, carrying out second classification according to the priority of 26 letters in the 2 lengths to obtain pinyin sets which are combined pairwise, wherein the pinyin sets refer to sets constructed by corresponding first letters with different lengths.

The beneficial effects of the above technical scheme are: through the pinyin length classification and the pinyin first letter classification, the pinyin of the pinyin list can be effectively combined in pairs, the combination efficiency is improved, and the efficiency of subsequently acquiring the editing distance is improved.

Example 5:

based on embodiment 3, when the determination instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word to a sensitive word database, including:

extracting the current judgment difference of each piece of judgment condition;

Preferably, the current judgment difference refers to a difference between an actual judgment result of the same sub-judgment condition and a preset judgment standard of the sub-judgment condition.

In this embodiment, the sub-determination condition may be a determination condition of whether a word in the sensitive word is a sensitive word or not, and the corresponding preset determination criterion is that the sensitivity heat of the sensitive word reaches a certain value, and the corresponding preset determination criterion may be satisfied.

The beneficial effects of the above technical scheme are: by determining the difference between the sub-judgment condition and the preset judgment standard, the candidate sensitive words can be effectively and automatically supplemented and expanded, and the accuracy of the subsequent prediction of the near word variants is improved.

Example 6:

based on embodiment 1, in the process of pre-judging the candidate phonetic near words and implementing the complementary expansion of the sensitive word database, the method further includes:

determining an initial historical pinyin corresponding to the initial historical sensitive word in the historical derivative chain and a modified historical pinyin after each derivative, and obtaining historical pinyin tracing information;

In the embodiment, the sensitive vocabulary auxiliary change list is constructed to further supplement the sensitive vocabulary database, ensure the rationality of supplement and enrich the database data base to a great extent.

In this embodiment, for example, candidate phonetic near words 1, 2, and 3 exist, at this time, the initial history sensitive word related to the candidate phonetic near words 1, 2, and 3 is obtained from the sensitive word database, that is, the initial sensitive word is found, so that the initial derivative chain of the initial history sensitive word is obtained based on the sensitive word database itself.

In this embodiment, due to the development of network technology, in the process of determining the near word, there may be a case where some characters are replaced or other alternative names are used to replace the original sensitive word, so that the history derivative chain can be effectively obtained by combining with the crawling result of the network sensitive word.

In this embodiment, since the initial derivative chain may be expressed by a word, but after optimization, there may be a case where an alphabetical expression or the like occurs, the type to which the corresponding historical derivative chain belongs may be determined by judging alphabetical characters of the initial sensitive word, the remaining sensitive words, or the like.

In the embodiment, a basis is provided for determining the change rule of the sensitive word by judging the alphabetical character, for example, the initial derivative chain is 00-01-02-1, the optimized historical derivative chain is 00-01-001-02-002-1, and the optimization can be adding the sensitive word to the initial derivative chain, and the like.

In this embodiment, the initial historical pinyin refers to the initial sensitive word, and the modified historical pinyin after each derivation, for example, the modified historical pinyin for 00 is derived from 00 to 01, and 01 is directed to 00, so as to obtain historical pinyin trace information, that is, one pinyin modification information in which 00 is derived from 01, 01 is derived from 001, 001 is derived from 002, and 002 is derived from 1.

In this embodiment, the historical search time set refers to that all search time points corresponding to the sensitive word from the beginning of derivation to the end of the time, i.e., the historical search time set is formed, and all search times corresponding to the sensitive word from the beginning of derivation to the end of the time, i.e., the historical search frequency set is formed.

In this embodiment, the first graph is formed by the search time points and the number of searches at each search time point, and the second graph is formed by the total duration and the time-sharing duration, where the historical derivative total duration refers to the historical derivative chain: in 00-01-001-02-002-1, the derivative period from 00 to 1, and the derivative period from each of the sections 00-01, 01-001, etc. are regarded as derivative period.

In this embodiment, the graph is pre-analyzed, i.e. mainly to determine the sensitive contribution degree of the sensitive words and the derivative chains.

In this embodiment, the first predetermined sensitivity contribution degree and the second predetermined sensitivity degree are preset, if: and if not, only constructing the mapping relation of the first sensitive words, mainly aiming at eliminating the charge sensitivity contribution condition of the sensitive words.

In this embodiment, the sensitive change capturing model is obtained by training samples based on various mapping relationships and corresponding rules, and therefore, a sensitive change rule can be obtained by training a first mapping relationship and a second mapping relationship, and a sensitive vocabulary auxiliary change list is further obtained, where the sensitive vocabulary auxiliary change list includes various different sensitive change bodies (changes of pinyin, letters, characters, and the like), so as to determine an auxiliary type that can adjust the candidate sensitive word, that is, for example, adjust the candidate phonetic word to a condition represented by a certain character, and finally, comprehensively judge the candidate phonetic word and the adjusted phonetic word, so as to enrich the sensitive word database.

The beneficial effects of the above technical scheme are: by constructing a historical derivative chain and analyzing different curve graphs, different mapping relations can be effectively obtained, and then an adjustment type can be obtained through model analysis, so that an adjustment near word can be obtained, and further, a candidate near word obtained simply according to pinyin change is improved, a sensitive word database can be further enriched, a basis is further provided for solving the characteristics of high amplification missing possibility, long time consumption and the like of the whole link, and the effectiveness of the extracted keywords is indirectly improved.

Example 7:

based on embodiment 6, obtaining a first sensitivity contribution degree corresponding to different history sensitive words in the same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain includes:

Where, sum { t _i The total search duration of a historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t _i Calendar representing corresponding ith history sensitive wordA history search time set; sum { p } _i Indicates the total search frequency of the historical search frequency set of the ith historical sensitive word in the corresponding same historical derivation chain; p is a radical of _i A history search frequency set representing a corresponding ith history sensitive word; g _i Representing the counted search frequency of the ith history sensitive word; t is _i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;

wherein the content of the first and second substances,

representing all history sensitive words in the corresponding same history derivative chain based on the history search time sett and the total sensitivity contribution degree obtained by the historical search frequency set p;

representing a sensitivity contribution factor corresponding to the ith history sensitivity word in the same history derivative chain determined based on the first graph;

In this embodiment, the time derivative factor and the sensitivity contribution factor may be a relative time-to-ratio and a sensitivity contribution-to-ratio, and the value range is (0, 1).

In this embodiment, the historical modification distance should be 0, 1 or 2, mainly considering the case that the corresponding edit distance satisfies the preset distance.

The beneficial effects of the above technical scheme are: by respectively calculating the first sensitive contribution degree in the same historical derivative chain and the second sensitive contribution degree of the corresponding derivative chain according to the formula, the subsequent sensitive words can be effectively extracted, the extraction efficiency is improved, the efficiency of establishing a mapping relation can be improved, and the reliability of the subsequent prediction of the near word variants is ensured.

Example 8:

based on the embodiment 1, the process of replacing any pinyin of characters in sensitive words in the sensitive word database based on the pinyin-near pinyin table includes:

performing complete overlapping processing on sensitive words on each first extension sublist, and acquiring a plurality of second word extension sublenses with the maximum inclusion range, wherein the second word extension sublenses comprise: a first extended subset consisting of only one single sensitive word and a second extended subset consisting of a plurality of sensitive words;

determining a first number of the first extended subset and a second number of the second extended subset and comparing the first number with the second number;

converting the extension degree of each extension sensitive word in the corresponding second extension subset into a corresponding extension sequence according to a standard conversion rule, and obtaining the extension sequence corresponding to the second extension subset according to the sequence of the extension sensitive words;

In this embodiment, the second sensitive word exists in the sensitive word database: 1. 2, 3 and 4, in which case corresponding first word extension subsets are constructed from the second sensitive words 1, 2, 3 and 4, respectively, such as: the first word extension set corresponding to the second sensitive word 1 is combined as: 1. 2, the first word extension subsets corresponding to the second sensitive word 2 are 1, 2, and 3, the first word extension subsets corresponding to the second sensitive word 3 are 2 and 3, and the first word extension subsets corresponding to the second sensitive word 4 are 4, at this time, after the complete overlapping processing is performed, the obtained second extension subsets are: 4 and the number of the second extension subsets 1, 2 and 3 is one, and in this case, the priority processing can be randomly performed on any one of the two subsets.

In this embodiment, the usage frequency of the sensitive word refers to a search frequency, which is a search frequency for the sensitive word obtained from the sensitive word database, and the higher the usage frequency is, the higher the probability that the sensitive word needs to be replaced is, that is, the earlier the tag time for preferential replacement is. For example, there are two first extension subsets, and the frequency of use of the sensitive words in the first extension subset is higher than the frequency of use of the sensitive words in the second first extension subset, at this time, the sensitive words in the first extension subset may be replaced with priority until all the sensitive words are replaced, and then the sensitive words in the second first extension subset are replaced with priority, that is, by setting a priority replacement order, it is first ensured that the replacement is performed effectively, and one is to improve the replacement efficiency.

In this embodiment, the central sensitive words in the second extended subset are, for example: may be extracted during the full overlap process, such as: the first word extends over subsets 1, 2, 3, and the corresponding central sensitive word may be 2. That is, the sensitive word with the largest overlapping number is regarded as the central sensitive word, and if a plurality of sensitive words with the largest overlapping number are consistent, one of the sensitive words is selected as the central sensitive word at will.

In this embodiment, the extension meaning refers to a network meaning of the extension sensitive word, and a pinyin modification position-time set refers to a set formed by corresponding time and letter positions modified as letters, so as to obtain an extension degree, where the more modification time points, the more modification positions and the more sensitive the extension meaning, the deeper the corresponding extension degree, that is, based on a standard conversion rule, the larger the value of the extension degree converted into an extension sequence is, and the value is generally (0, 1).

In this embodiment, for example: the extension sequences in the second extension subset are: 00.30.20.1000110.8, performing size consistency division on the sequence, such as: and (3) sub-region: 00.30.20.1000, the corresponding mode sequence is 0, sub-region: 110.8, the corresponding mode sequence is 1. In this case, the order may be determined by setting the label replacement time according to the corresponding mode sequence size.

The beneficial effects of the above technical scheme are: the first extended subset and the second extended subset can be effectively obtained by determining the extended sub-list and complete overlapping processing, and pinyin sequential replacement can be effectively carried out by setting the replacement labels and determining the label replacement time attribute, wherein in the pinyin replacement process in the second extended subset, the extension degree of the extended sensitive words needs to be obtained, so that the second extended subset is subjected to region division by sequence determination, thereby ensuring the priority processing of the replacement words with large variation degree, ensuring the variant efficiency and improving the reliability of the subsequent predication of the near word variants.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for extending a proximity sensitive word, comprising:

step 1: combining every two phonetic alphabets in the involution spelling table;

and 4, step 4: and pre-judging the candidate phonetic word to realize the supplementary expansion of the sensitive word database.

2. The method for expanding the phonetic close sensitive words according to claim 1, wherein the step of obtaining the edit distance of each pairwise combination result, extracting a phonetic close pinyin group according to the edit distance, and constructing a pinyin-phonetic close pinyin list comprises the steps of:

calculating the editing distance of the pairwise combination result;

judging whether the editing distance is greater than a preset distance;

3. The method for expanding the phonetic proximity sensitive words according to claim 1, wherein the candidate phonetic proximity words are pre-judged to realize the supplementary expansion of the sensitive word database, and the method comprises the following steps:

4. The method for extending a proximity-sensitive word as recited in claim 1, wherein the legal pinyin table and the pinyin-chinese table include only pronunciation pinyins and no pronunciation tones.

5. The method of claim 1, wherein combining every two pinyins in the legal pinyin table comprises:

performing first classification according to the pinyin length of each pinyin;

performing second classification on the pinyin in each first classification result according to the type of the pinyin initial letter;

and combining the pinyin in the pinyin set in pairs.

6. The method for expanding the proximity sensitive word according to claim 3, wherein when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word into a sensitive word database, comprising:

extracting the current judgment difference of each judgment condition;

7. The method for expanding the phonetic proximity sensitive words according to claim 1, wherein the process of pre-judging the candidate phonetic proximity words to realize the supplementary expansion of the sensitive word database further comprises:

capturing and analyzing all the first mapping relations and the second mapping relations based on a sensitive change capturing model, outputting to obtain a sensitive change rule, and constructing a sensitive vocabulary auxiliary change list based on the sensitive change rule;

8. The method for extending phonosensitive words according to claim 7, wherein obtaining a first sensitivity contribution degree corresponding to different history sensitive words in a same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain comprises:

Where, sum { t _i The total search duration of the historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t _i A historical search time set representing a corresponding ith history sensitive word; sum { p } _i Indicates the total search frequency of the historical search frequency set of the ith historical sensitive word in the corresponding same historical derivation chain; p is a radical of _i A history search frequency set representing a corresponding ith history sensitive word; g _i Representing the counted search frequency of the ith history sensitive word; t is _i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;

representing corresponding same historical derivative chain(iii) historical derived total duration;

wherein the content of the first and second substances,

representing the classification type of the corresponding same historical derivative chain, and taking the value of 0.9 when the classification type is a first class, taking the value of 0.8 when the classification type is a second class, and taking the value of 1 when the classification type is a third class; n represents the total number of history sensitive words contained in the corresponding same history derivative chain; a. the _i Representing the historical modification distance of the ith history sensitive word in the corresponding same history derivative chain; b _i (t, p) represents a first sensitivity contribution degree obtained by the ith history sensitive word in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;

representing the same history determined based on the first graphA sensitivity contribution factor corresponding to the ith history sensitive word in the derivative chain;

9. The method as claimed in claim 6, wherein the current judgment difference is a difference between the actual judgment result of the human being on the same sub-judgment condition and the preset judgment criterion of the sub-judgment condition.

10. The method for extending phonetic-near sensitive words according to claim 1, wherein the process of replacing any pinyin of a sensitive word in the sensitive word database based on the pinyin-phonetic-near pinyin table comprises:

otherwise, preferentially determining a central sensitive word in the second extension subset, and determining the extension meaning of each extension sensitive word extended based on the central sensitive word, and a first pinyin modification position-time set and a second pinyin modification position-time set of each extension sensitive word based on the central sensitive word and adjacent sensitive words;