CN114969294A - Expansion method of sound-proximity sensitive words - Google Patents

Expansion method of sound-proximity sensitive words Download PDF

Info

Publication number
CN114969294A
CN114969294A CN202210613700.2A CN202210613700A CN114969294A CN 114969294 A CN114969294 A CN 114969294A CN 202210613700 A CN202210613700 A CN 202210613700A CN 114969294 A CN114969294 A CN 114969294A
Authority
CN
China
Prior art keywords
sensitive
pinyin
history
historical
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210613700.2A
Other languages
Chinese (zh)
Inventor
刘堡萱
文成明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Baichuan Shuan Technology Co ltd
Original Assignee
Shenzhen Baichuan Shuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Baichuan Shuan Technology Co ltd filed Critical Shenzhen Baichuan Shuan Technology Co ltd
Priority to CN202210613700.2A priority Critical patent/CN114969294A/en
Publication of CN114969294A publication Critical patent/CN114969294A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an expansion method of a near-sensitive word, which comprises the following steps: combining every two phonetics in the spelling list; acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing to obtain a pinyin-near-pinyin list; replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin table, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character table, and constructing to obtain candidate near-voice words; and pre-judging the candidate phonetic near words to realize the supplementary expansion of the sensitive word database. By using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.

Description

Expansion method of sound-proximity sensitive words
Technical Field
The invention relates to the technical field of vocabulary expansion, in particular to an expansion method of a near-speech sensitive word.
Background
In an internet community scene, due to the particularity of the scene, a piece of information is often received by a great number of people, so that the phenomenon that some illegal platforms are guided by issuing some messages in the form of texts or pictures often occurs in the internet community, the guided messages are generally considered as junk contents, and publishers of the messages are generally considered as black products. Because the texts sent by the method often have some similar keywords, the words are used as sensitive words, and illegal contents can be timely processed by identifying whether the texts contain the sensitive words, so that the purpose of maintaining community atmosphere is achieved, namely the contents are anti-spam. But the poster of the illegal content often avoids the extracted sensitive words through an experimental means, so that the hit number of the sensitive words is reduced. Therefore, how to update the sensitive thesaurus is crucial to content anti-cheating. Most current platforms supplement the sensitive word stock by means of manual extraction.
Although the method for supplementing the sensitive word stock by manually extracting the keywords can achieve a certain anti-cheating effect, the method cannot solve the problem in the prior art, because each sensitive word is sent out through a black product and then is obtained through manual review and extraction, the whole process is long in time consumption, and no way is provided for ensuring that each illegal text is identified, so that a great amount of omission exists. Even if the keywords are extracted smoothly, the newly added keywords are not necessarily effective because the technology is updated quickly in black birth.
Therefore, the invention provides an expansion method of the sound-proximity sensitive words.
Disclosure of Invention
The invention provides an expansion method of a near-sound sensitive word, which is used for speculating a near-sound word variant of the sensitive word which is possibly used in black productions by using an existing sensitive word library and a near-sound word list so as to solve the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link in advance and further improve the effectiveness of the extracted keyword.
The invention provides an expansion method of a near-sensitive word, which comprises the following steps:
step 1: combining every two phonetics in the spelling list;
step 2: acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing a pinyin-near-pinyin table;
and step 3: replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin list, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character list, and constructing to obtain candidate near-voice words;
and 4, step 4: and pre-judging the candidate phonetic near words to realize the supplementary expansion of the sensitive word database.
In a possible implementation manner, obtaining an edit distance of each pairwise combination result, extracting a near-pinyin group according to the edit distance, and constructing a pinyin-near-pinyin table, including:
calculating the edit distance of the pairwise combination result;
judging whether the editing distance is greater than a preset distance;
if yes, the first pinyin and the second pinyin in the corresponding combination result are judged to be not near pinyin;
otherwise, judging that the first pinyin and the second pinyin in the corresponding combination result are phonetic near pinyin;
extracting all near-phonetic groups of the mutual near-phonetic alphabets, and constructing to obtain a pinyin-near-phonetic alphabet table.
In a possible implementation manner, the pre-judging is performed on the candidate near word, so as to implement the complementary expansion of the sensitive word database, including:
acquiring a judgment instruction of a candidate phonetic word by an artificial person, and judging whether the judgment instruction is related to a preset sensitive condition;
when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive words into a sensitive word database;
and when the judgment instruction is irrelevant to a preset sensitive condition, discarding the candidate sensitive word.
In one possible implementation, the legal pinyin table and the pinyin-character table only include pronunciation pinyins and no pronunciation tones.
In one possible implementation, combining two spellings in the legal spelling table comprises:
performing first classification according to the pinyin length of each pinyin;
performing second classification on the pinyin in each first classification result according to the pinyin initial type;
obtaining pinyin sets with different pinyin lengths based on the first classification result and the second classification result;
and combining the pinyin in the pinyin set in pairs.
In a possible implementation manner, when the determination instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word to a sensitive word database, including:
capturing a judgment instruction of the candidate phonetic proximal words by the manual work, and analyzing the judgment instruction to obtain a plurality of judgment conditions;
extracting the current judgment difference of each piece of judgment condition;
judging whether the corresponding current judgment difference is within a preset difference range or not based on a preset sensitive condition, and if so, judging that the corresponding sub-judgment condition is qualified;
and when the sub-judgment conditions are qualified, the judgment instruction is related to a preset sensitive condition.
In a possible implementation manner, the pre-determining is performed on the candidate phonetic proximal words, and in the process of implementing the supplementary expansion of the sensitive word database, the method further includes:
tracing the sensitive word database, extracting initial historical sensitive words related to the candidate phonetic near words, constructing initial derivative chains of different initial historical sensitive words, and optimizing the initial derivative chains by combining the crawling results of the network sensitive words to obtain historical derivative chains;
judging whether the initial sensitive words in each history derivative chain have alphabetic characters or not, regarding the history derivative chains with the alphabetic characters as a first type, and judging whether the rest history sensitive words in the history derivative chains without the alphabetic characters in the initial sensitive words have the alphabetic characters again;
according to the result of the second judgment, extracting the historical derivative chains without the alphabetic characters in the remaining sensitive words, and regarding the historical derivative chains as a second class, and meanwhile, regarding the remaining historical derivative chains as a third class;
determining an initial historical pinyin corresponding to the initial historical sensitive word in the historical derivative chain and a modified historical pinyin after each derivative, and acquiring historical pinyin tracing information;
counting historical search time sets and historical search frequency sets of different historical sensitive words in the same historical derivative chain, and constructing a first curve graph of the same historical derivative chain;
acquiring historical derivative total time length of the same historical derivative chain and derivative time length of each derivative stage according to the historical pinyin tracing information, and constructing to obtain a second graph;
pre-analyzing the first graph and the second graph based on a curve analysis model, and meanwhile, obtaining first sensitivity contribution degrees corresponding to different history sensitive words in the same history derivative chain and second sensitivity contribution degrees corresponding to the same history derivative chain by combining with the classification type of the same history derivative chain;
extracting first sensitive words with a first sensitivity contribution degree greater than a first preset sensitivity contribution degree from the same historical derivative chain, and sequencing all the first sensitive words in the same historical derivative chain according to the first sensitivity contribution degree;
when the second sensitivity contribution degree is greater than a second preset sensitivity degree, establishing a first mapping relation between the corresponding first sensitive word and the corresponding same historical derivative chain according to the sorting result;
otherwise, establishing a second mapping relation of the corresponding first sensitive words according to the sorting result;
based on a sensitive change capturing model, capturing and analyzing all the first mapping relations and the second mapping relations, outputting to obtain a sensitive change rule, and constructing a sensitive vocabulary auxiliary change list based on the sensitive change rule;
determining a vocabulary adjustment type for the candidate phonetic proximal words based on the sensitive vocabulary auxiliary change list;
and determining corresponding adjusted sound near words based on the vocabulary adjustment type.
In one possible implementation manner, obtaining a first sensitivity contribution degree corresponding to different history sensitive words in the same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain includes:
calculating the first sensitivity contribution degree B of different history sensitive words in the corresponding same history derivative chain according to the following formula i (t,p);
Figure BDA0003672731900000041
Where, sum { t i The total search duration of a historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t i A historical search time set representing a corresponding ith history sensitive word; sum { p } i Representing the total search frequency of a historical search frequency set of the ith historical sensitive word in the corresponding same historical derivative chain; p is a radical of formula i A history search frequency set representing a corresponding ith history sensitive word; g i Representing the counted search frequency of the ith history sensitive word; t is i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;
Figure BDA0003672731900000051
representing the historical derivative total time length of the corresponding same historical derivative chain;
Figure BDA0003672731900000052
representing the counted total search frequency of all history sensitive words in the corresponding history derivative chain;
calculating a second sensitivity contribution degree Y of the corresponding same historical derivative chain according to the following formula;
Figure BDA0003672731900000053
wherein the content of the first and second substances,
Figure BDA0003672731900000054
representing the classification type of the corresponding same historical derivative chain, and taking the value of 0.9 when the classification type is a first class, taking the value of 0.8 when the classification type is a second class, and taking the value of 1 when the classification type is a third class; n represents the total number of history sensitive words contained in the corresponding same history derivative chain; a. the i Representing the historical modification distance of the ith history sensitive word in the corresponding same history derivative chain; b is i (t, p) represents a first sensitivity contribution degree obtained by the ith history sensitive word in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;
Figure BDA0003672731900000055
representing the total sensitivity contribution degree of all history sensitive words in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;
Figure BDA0003672731900000056
representing a time derivative factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the second graph;
Figure BDA0003672731900000057
representing a sensitivity contribution factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the first graph;
and extracting the first sensitive words based on all the first sensitive contribution degrees and the second sensitive contribution degrees.
In a possible implementation manner, the current judgment difference refers to a difference between an actual judgment result of the same sub-judgment condition and a preset judgment standard of the sub-judgment condition.
In a possible implementation manner, in the process of replacing any pinyin of a sensitive word in the sensitive word database based on the pinyin-near pinyin table, the method includes:
determining a first word extension subset of each second sensitive word in the sensitive word database, and constructing a first extension sublist of each first word extension subset;
performing complete overlapping processing on sensitive words on each first extension sublist, and acquiring a plurality of second word extension sublets with the largest inclusion range, wherein the second word extension sublets comprise: a first extended subset consisting of only one single sensitive word and a second extended subset consisting of a plurality of sensitive words;
determining a first number of said first extended subset and a second number of second extended subset and comparing said first number with said second number;
if the first number is larger than or equal to the second number, preferentially setting a first replacement label for the first extension subset according to the use frequency of the sensitive words in the first extension subset, and replacing any character pinyin in the sensitive words in the corresponding first extension subset in sequence according to the label replacement time attribute of the first replacement label;
otherwise, preferentially determining a central sensitive word in the second extension subset, and determining the extension meaning of each extension sensitive word extending based on the central sensitive word, and a first pinyin modification position-time set of each extension sensitive word based on the central sensitive word and a second pinyin modification position-time set based on adjacent sensitive words;
acquiring the extension degree of the corresponding extension sensitive word based on the extension meaning, the first pinyin modification position-time set and the second pinyin modification position-time set;
converting the extension degree of each extension sensitive word in the corresponding second extension subset into a corresponding extension sequence according to a standard conversion rule, and obtaining the extension sequence corresponding to the second extension subset according to the sequence generation order of the extension sensitive words;
according to the size consistency of each sequence in the extension sequences, performing region splitting on the extension sequences to obtain a plurality of sub-regions, and respectively performing setting of second replacement labels on the corresponding sub-regions according to the mode sequence of each sub-region;
and sequentially replacing any character pinyin of the extension sensitive words in different sub-areas in the same second extension subset according to the label replacement time attribute of the second replacement label.
Compared with the prior art, the beneficial effects of this application are as follows:
by using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart illustrating a method for expanding a proximity sensitive word according to an embodiment of the present invention;
fig. 2 is a diagram illustrating an embodiment of a method for expanding a proximity sensitive word according to the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1:
the invention provides an expansion method of a near-sensitive word, which comprises the following steps as shown in figure 1:
step 1: combining every two phonetics in the spelling list;
step 2: acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing a pinyin-near-pinyin table;
and step 3: replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin list, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character list, and constructing to obtain candidate near-voice words;
and 4, step 4: and pre-judging the candidate phonetic near words to realize the supplementary expansion of the sensitive word database.
Preferably, the legal pinyin list and the pinyin-character list only include pronunciation pinyins and do not include pronunciation tones.
In this embodiment, the edit distance may be calculated by a levenstein method, for example, the pairwise combination result is: pinyin li and pinyin ni, where the edit distance of pinyin li and ni is 1, since only one step is required to replace "l" with "n".
In this embodiment, the legal pinyin list may be formed by the existing chinese pinyin, and the pinyin-near pinyin list, for example, is formed by combining two combinations of the above-mentioned "li" and "ni" similar to each other, and represents a mapping relationship between a pinyin and a pinyin near the pinyin thereof. The pinyin-hanzi table may also be constructed based on existing pinyins and hanzi corresponding to the pinyins, and represents a mapping relationship between a pinyin and homonymous hanzi corresponding to the pronunciation of the pinyin.
In this embodiment, the sensitive word database is pre-constructed.
In this embodiment, the candidate phonetic word refers to a word obtained after the pinyin-near replacement.
In this embodiment, the pre-determination means that the candidate phonetic near words are determined manually, and if the candidate phonetic near words can be retained, the candidate phonetic near words are extended into the corresponding sensitive word database in a supplementary manner, and if the candidate phonetic near words cannot be retained, the candidate phonetic near words are directly removed. In this embodiment, any pinyin of the sensitive words in the sensitive word database is replaced based on the pinyin-near pinyin table, such as: the pinyin "zhuhousouu" can replace "zhi" in the pinyin with "zi", and a homophonic character table is used for mapping the pinyin with the replaced near pinyin into characters, namely "zhuhouu" can be mapped into "character search", and the specific flow is shown in fig. 2.
In the embodiment, the phonetic near characters are generated by defining the phonetic near pinyin according to the Levenson distance, then the sensitive word bank is expanded, possible variants of the sensitive words are generated in advance, the unreadable sensitive words are eliminated after manual review, and the unreadable variants are added into the sensitive word bank, so that the effect of recognizing junk texts is achieved.
Meanwhile, the solution of adding the sensitive words has the characteristics of high iteration speed and short period compared with some solutions using complex language recognition models, such as Bert.
The beneficial effects of the above technical scheme are: by using the existing sensitive word stock and the near word list, the near word variants of the sensitive words which may be used in the black productions are presumed, so that the characteristics of high possibility of missed amplification, long time consumption and the like of the whole link are solved in a preposed manner, and the effectiveness of the extracted keywords is further improved.
Example 2:
based on the embodiment 1, the editing distance of each pairwise combination result is obtained, and a near-pinyin group is extracted according to the editing distance to construct and obtain a pinyin-near-pinyin list, which includes:
calculating the edit distance of the pairwise combination result;
judging whether the editing distance is greater than a preset distance;
if yes, the first pinyin and the second pinyin in the corresponding combination result are judged to be not near pinyin;
otherwise, judging that the first pinyin and the second pinyin in the corresponding combination result are phonetic near pinyin;
extracting all near-phonetic groups of the mutual near-phonetic alphabets, and constructing to obtain a pinyin-near-phonetic alphabet table.
In this embodiment, the preset distance may be set to 2.
The beneficial effects of the above technical scheme are: by distance comparison, the phoneme and the unvoiced phoneme can be effectively and preliminarily determined, and a basis is provided for subsequently predicting the phoneme word variants.
Example 3:
on the basis of the embodiment 1, the candidate phonetic near words are pre-judged, so that the complementary expansion of the sensitive word database is realized, and the method comprises the following steps:
acquiring a judgment instruction of a candidate phonetic near word by manpower, and judging whether the judgment instruction is related to a preset sensitive condition;
when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive words into a sensitive word database;
and when the judgment instruction is not related to a preset sensitive condition, discarding the candidate sensitive word.
In this embodiment, for example, if there is a sensitive word, a1, at this time, the corresponding judgment instruction is that a certain word in the sensitive word a1 may be used as a sensitive word, and at this time, the sensitive word also satisfies the preset sensitive condition, and it may be regarded as a correlation, and whether the expanded sensitive word is easily read or not, whether the meaning direction is obvious or not, whether there is ambiguity or not, and the like.
The beneficial effects of the above technical scheme are: whether the candidate sensitive words are discarded or not is determined by analyzing the manual judgment instruction, and the rationality of the sensitive word database is ensured.
Example 4:
based on the embodiment 1, the combination of every two pinyin in the legal pinyin list comprises:
performing first classification according to the pinyin length of each pinyin;
performing second classification on the pinyin in each first classification result according to the pinyin initial type;
obtaining pinyin sets with different pinyin lengths based on the first classification result and the second classification result;
and combining the pinyin in the pinyin set in pairs.
In this embodiment, for example, first, according to the pinyin length: 2 lengths, 3 lengths and the like, and secondly, carrying out second classification according to the priority of 26 letters in the 2 lengths to obtain pinyin sets which are combined pairwise, wherein the pinyin sets refer to sets constructed by corresponding first letters with different lengths.
The beneficial effects of the above technical scheme are: through the pinyin length classification and the pinyin first letter classification, the pinyin of the pinyin list can be effectively combined in pairs, the combination efficiency is improved, and the efficiency of subsequently acquiring the editing distance is improved.
Example 5:
based on embodiment 3, when the determination instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word to a sensitive word database, including:
capturing a judgment instruction of the candidate phonetic proximal words by the manual work, and analyzing the judgment instruction to obtain a plurality of judgment conditions;
extracting the current judgment difference of each piece of judgment condition;
judging whether the corresponding current judgment difference is within a preset difference range or not based on a preset sensitive condition, and if so, judging that the corresponding sub-judgment condition is qualified;
and when the sub-judgment conditions are qualified, the judgment instruction is related to a preset sensitive condition.
Preferably, the current judgment difference refers to a difference between an actual judgment result of the same sub-judgment condition and a preset judgment standard of the sub-judgment condition.
In this embodiment, the sub-determination condition may be a determination condition of whether a word in the sensitive word is a sensitive word or not, and the corresponding preset determination criterion is that the sensitivity heat of the sensitive word reaches a certain value, and the corresponding preset determination criterion may be satisfied.
The beneficial effects of the above technical scheme are: by determining the difference between the sub-judgment condition and the preset judgment standard, the candidate sensitive words can be effectively and automatically supplemented and expanded, and the accuracy of the subsequent prediction of the near word variants is improved.
Example 6:
based on embodiment 1, in the process of pre-judging the candidate phonetic near words and implementing the complementary expansion of the sensitive word database, the method further includes:
tracing the sensitive word database, extracting initial historical sensitive words related to the candidate phonetic near words, constructing initial derivative chains of different initial historical sensitive words, and optimizing the initial derivative chains by combining the crawling results of the network sensitive words to obtain historical derivative chains;
judging whether the initial sensitive words in each history derivative chain have alphabetic characters or not, regarding the history derivative chains with the alphabetic characters as a first type, and judging whether the rest history sensitive words in the history derivative chains without the alphabetic characters in the initial sensitive words have the alphabetic characters again;
according to the result of the second judgment, extracting the historical derivative chains without the alphabetic characters in the remaining sensitive words, and regarding the historical derivative chains as a second class, and meanwhile, regarding the remaining historical derivative chains as a third class;
determining an initial historical pinyin corresponding to the initial historical sensitive word in the historical derivative chain and a modified historical pinyin after each derivative, and obtaining historical pinyin tracing information;
counting historical search time sets and historical search frequency sets of different historical sensitive words in the same historical derivative chain, and constructing a first curve graph of the same historical derivative chain;
acquiring historical derivative total time length of the same historical derivative chain and derivative time length of each derivative stage according to the historical pinyin tracing information, and constructing to obtain a second graph;
pre-analyzing the first graph and the second graph based on a curve analysis model, and meanwhile, obtaining first sensitivity contribution degrees corresponding to different history sensitive words in the same history derivative chain and second sensitivity contribution degrees corresponding to the same history derivative chain by combining with the classification type of the same history derivative chain;
extracting first sensitive words with a first sensitivity contribution degree greater than a first preset sensitivity contribution degree from the same historical derivative chain, and sequencing all the first sensitive words in the same historical derivative chain according to the first sensitivity contribution degree;
when the second sensitivity contribution degree is greater than a second preset sensitivity degree, establishing a first mapping relation between the corresponding first sensitive word and the corresponding same historical derivative chain according to the sorting result;
otherwise, establishing a second mapping relation of the corresponding first sensitive words according to the sorting result;
based on a sensitive change capturing model, capturing and analyzing all the first mapping relations and the second mapping relations, outputting to obtain a sensitive change rule, and constructing a sensitive vocabulary auxiliary change list based on the sensitive change rule;
determining a vocabulary adjustment type for the candidate phonetic proximal words based on the sensitive vocabulary auxiliary change list;
and determining corresponding adjusted sound near words based on the vocabulary adjustment type.
In the embodiment, the sensitive vocabulary auxiliary change list is constructed to further supplement the sensitive vocabulary database, ensure the rationality of supplement and enrich the database data base to a great extent.
In this embodiment, for example, candidate phonetic near words 1, 2, and 3 exist, at this time, the initial history sensitive word related to the candidate phonetic near words 1, 2, and 3 is obtained from the sensitive word database, that is, the initial sensitive word is found, so that the initial derivative chain of the initial history sensitive word is obtained based on the sensitive word database itself.
In this embodiment, due to the development of network technology, in the process of determining the near word, there may be a case where some characters are replaced or other alternative names are used to replace the original sensitive word, so that the history derivative chain can be effectively obtained by combining with the crawling result of the network sensitive word.
In this embodiment, since the initial derivative chain may be expressed by a word, but after optimization, there may be a case where an alphabetical expression or the like occurs, the type to which the corresponding historical derivative chain belongs may be determined by judging alphabetical characters of the initial sensitive word, the remaining sensitive words, or the like.
In the embodiment, a basis is provided for determining the change rule of the sensitive word by judging the alphabetical character, for example, the initial derivative chain is 00-01-02-1, the optimized historical derivative chain is 00-01-001-02-002-1, and the optimization can be adding the sensitive word to the initial derivative chain, and the like.
In this embodiment, the initial historical pinyin refers to the initial sensitive word, and the modified historical pinyin after each derivation, for example, the modified historical pinyin for 00 is derived from 00 to 01, and 01 is directed to 00, so as to obtain historical pinyin trace information, that is, one pinyin modification information in which 00 is derived from 01, 01 is derived from 001, 001 is derived from 002, and 002 is derived from 1.
In this embodiment, the historical search time set refers to that all search time points corresponding to the sensitive word from the beginning of derivation to the end of the time, i.e., the historical search time set is formed, and all search times corresponding to the sensitive word from the beginning of derivation to the end of the time, i.e., the historical search frequency set is formed.
In this embodiment, the first graph is formed by the search time points and the number of searches at each search time point, and the second graph is formed by the total duration and the time-sharing duration, where the historical derivative total duration refers to the historical derivative chain: in 00-01-001-02-002-1, the derivative period from 00 to 1, and the derivative period from each of the sections 00-01, 01-001, etc. are regarded as derivative period.
In this embodiment, the graph is pre-analyzed, i.e. mainly to determine the sensitive contribution degree of the sensitive words and the derivative chains.
In this embodiment, the first predetermined sensitivity contribution degree and the second predetermined sensitivity degree are preset, if: and if not, only constructing the mapping relation of the first sensitive words, mainly aiming at eliminating the charge sensitivity contribution condition of the sensitive words.
In this embodiment, the sensitive change capturing model is obtained by training samples based on various mapping relationships and corresponding rules, and therefore, a sensitive change rule can be obtained by training a first mapping relationship and a second mapping relationship, and a sensitive vocabulary auxiliary change list is further obtained, where the sensitive vocabulary auxiliary change list includes various different sensitive change bodies (changes of pinyin, letters, characters, and the like), so as to determine an auxiliary type that can adjust the candidate sensitive word, that is, for example, adjust the candidate phonetic word to a condition represented by a certain character, and finally, comprehensively judge the candidate phonetic word and the adjusted phonetic word, so as to enrich the sensitive word database.
The beneficial effects of the above technical scheme are: by constructing a historical derivative chain and analyzing different curve graphs, different mapping relations can be effectively obtained, and then an adjustment type can be obtained through model analysis, so that an adjustment near word can be obtained, and further, a candidate near word obtained simply according to pinyin change is improved, a sensitive word database can be further enriched, a basis is further provided for solving the characteristics of high amplification missing possibility, long time consumption and the like of the whole link, and the effectiveness of the extracted keywords is indirectly improved.
Example 7:
based on embodiment 6, obtaining a first sensitivity contribution degree corresponding to different history sensitive words in the same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain includes:
calculating the first sensitivity contribution degree B of different history sensitive words in the corresponding same history derivative chain according to the following formula i (t,p);
Figure BDA0003672731900000141
Where, sum { t i The total search duration of a historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t i Calendar representing corresponding ith history sensitive wordA history search time set; sum { p } i Indicates the total search frequency of the historical search frequency set of the ith historical sensitive word in the corresponding same historical derivation chain; p is a radical of i A history search frequency set representing a corresponding ith history sensitive word; g i Representing the counted search frequency of the ith history sensitive word; t is i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;
Figure BDA0003672731900000142
representing the historical derivative total time length of the corresponding same historical derivative chain;
Figure BDA0003672731900000143
representing the counted total search frequency of all history sensitive words in the corresponding history derivative chain;
calculating a second sensitivity contribution degree Y of the corresponding same historical derivative chain according to the following formula;
Figure BDA0003672731900000151
wherein the content of the first and second substances,
Figure BDA0003672731900000152
representing the classification type of the corresponding same historical derivative chain, and taking the value of 0.9 when the classification type is a first class, taking the value of 0.8 when the classification type is a second class, and taking the value of 1 when the classification type is a third class; n represents the total number of history sensitive words contained in the corresponding same history derivative chain; a. the i Representing the historical modification distance of the ith history sensitive word in the corresponding same history derivative chain; b is i (t, p) represents a first sensitivity contribution degree obtained by the ith history sensitive word in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;
Figure BDA0003672731900000153
representing all history sensitive words in the corresponding same history derivative chain based on the history search time sett and the total sensitivity contribution degree obtained by the historical search frequency set p;
Figure BDA0003672731900000154
representing a time derivative factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the second graph;
Figure BDA0003672731900000155
representing a sensitivity contribution factor corresponding to the ith history sensitivity word in the same history derivative chain determined based on the first graph;
and extracting the first sensitive words based on all the first sensitive contribution degrees and the second sensitive contribution degrees.
In this embodiment, the time derivative factor and the sensitivity contribution factor may be a relative time-to-ratio and a sensitivity contribution-to-ratio, and the value range is (0, 1).
In this embodiment, the historical modification distance should be 0, 1 or 2, mainly considering the case that the corresponding edit distance satisfies the preset distance.
The beneficial effects of the above technical scheme are: by respectively calculating the first sensitive contribution degree in the same historical derivative chain and the second sensitive contribution degree of the corresponding derivative chain according to the formula, the subsequent sensitive words can be effectively extracted, the extraction efficiency is improved, the efficiency of establishing a mapping relation can be improved, and the reliability of the subsequent prediction of the near word variants is ensured.
Example 8:
based on the embodiment 1, the process of replacing any pinyin of characters in sensitive words in the sensitive word database based on the pinyin-near pinyin table includes:
determining a first word extension subset of each second sensitive word in the sensitive word database, and constructing a first extension sublist of each first word extension subset;
performing complete overlapping processing on sensitive words on each first extension sublist, and acquiring a plurality of second word extension sublenses with the maximum inclusion range, wherein the second word extension sublenses comprise: a first extended subset consisting of only one single sensitive word and a second extended subset consisting of a plurality of sensitive words;
determining a first number of the first extended subset and a second number of the second extended subset and comparing the first number with the second number;
if the first number is larger than or equal to the second number, preferentially setting a first replacement label for the first extension subset according to the use frequency of the sensitive words in the first extension subset, and replacing any character pinyin in the sensitive words in the corresponding first extension subset in sequence according to the label replacement time attribute of the first replacement label;
otherwise, preferentially determining a central sensitive word in the second extension subset, and determining the extension meaning of each extension sensitive word extending based on the central sensitive word, and a first pinyin modification position-time set of each extension sensitive word based on the central sensitive word and a second pinyin modification position-time set based on adjacent sensitive words;
acquiring the extension degree of the corresponding extension sensitive word based on the extension meaning, the first pinyin modification position-time set and the second pinyin modification position-time set;
converting the extension degree of each extension sensitive word in the corresponding second extension subset into a corresponding extension sequence according to a standard conversion rule, and obtaining the extension sequence corresponding to the second extension subset according to the sequence of the extension sensitive words;
according to the size consistency of each sequence in the extension sequences, performing region splitting on the extension sequences to obtain a plurality of sub-regions, and respectively performing setting of second replacement labels on the corresponding sub-regions according to the mode sequence of each sub-region;
and sequentially replacing any character pinyin of the extension sensitive words in different sub-areas in the same second extension subset according to the label replacement time attribute of the second replacement label.
In this embodiment, the second sensitive word exists in the sensitive word database: 1. 2, 3 and 4, in which case corresponding first word extension subsets are constructed from the second sensitive words 1, 2, 3 and 4, respectively, such as: the first word extension set corresponding to the second sensitive word 1 is combined as: 1. 2, the first word extension subsets corresponding to the second sensitive word 2 are 1, 2, and 3, the first word extension subsets corresponding to the second sensitive word 3 are 2 and 3, and the first word extension subsets corresponding to the second sensitive word 4 are 4, at this time, after the complete overlapping processing is performed, the obtained second extension subsets are: 4 and the number of the second extension subsets 1, 2 and 3 is one, and in this case, the priority processing can be randomly performed on any one of the two subsets.
In this embodiment, the usage frequency of the sensitive word refers to a search frequency, which is a search frequency for the sensitive word obtained from the sensitive word database, and the higher the usage frequency is, the higher the probability that the sensitive word needs to be replaced is, that is, the earlier the tag time for preferential replacement is. For example, there are two first extension subsets, and the frequency of use of the sensitive words in the first extension subset is higher than the frequency of use of the sensitive words in the second first extension subset, at this time, the sensitive words in the first extension subset may be replaced with priority until all the sensitive words are replaced, and then the sensitive words in the second first extension subset are replaced with priority, that is, by setting a priority replacement order, it is first ensured that the replacement is performed effectively, and one is to improve the replacement efficiency.
In this embodiment, the central sensitive words in the second extended subset are, for example: may be extracted during the full overlap process, such as: the first word extends over subsets 1, 2, 3, and the corresponding central sensitive word may be 2. That is, the sensitive word with the largest overlapping number is regarded as the central sensitive word, and if a plurality of sensitive words with the largest overlapping number are consistent, one of the sensitive words is selected as the central sensitive word at will.
In this embodiment, the extension meaning refers to a network meaning of the extension sensitive word, and a pinyin modification position-time set refers to a set formed by corresponding time and letter positions modified as letters, so as to obtain an extension degree, where the more modification time points, the more modification positions and the more sensitive the extension meaning, the deeper the corresponding extension degree, that is, based on a standard conversion rule, the larger the value of the extension degree converted into an extension sequence is, and the value is generally (0, 1).
In this embodiment, for example: the extension sequences in the second extension subset are: 00.30.20.1000110.8, performing size consistency division on the sequence, such as: and (3) sub-region: 00.30.20.1000, the corresponding mode sequence is 0, sub-region: 110.8, the corresponding mode sequence is 1. In this case, the order may be determined by setting the label replacement time according to the corresponding mode sequence size.
The beneficial effects of the above technical scheme are: the first extended subset and the second extended subset can be effectively obtained by determining the extended sub-list and complete overlapping processing, and pinyin sequential replacement can be effectively carried out by setting the replacement labels and determining the label replacement time attribute, wherein in the pinyin replacement process in the second extended subset, the extension degree of the extended sensitive words needs to be obtained, so that the second extended subset is subjected to region division by sequence determination, thereby ensuring the priority processing of the replacement words with large variation degree, ensuring the variant efficiency and improving the reliability of the subsequent predication of the near word variants.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for extending a proximity sensitive word, comprising:
step 1: combining every two phonetic alphabets in the involution spelling table;
step 2: acquiring the editing distance of each pairwise combination result, extracting a near-phonetic group according to the editing distance, and constructing a pinyin-near-pinyin table;
and step 3: replacing any character pinyin in the sensitive words in the sensitive word database based on the pinyin-near pinyin list, mapping the replaced any character pinyin into characters based on the pinyin-Chinese character list, and constructing to obtain candidate near-voice words;
and 4, step 4: and pre-judging the candidate phonetic word to realize the supplementary expansion of the sensitive word database.
2. The method for expanding the phonetic close sensitive words according to claim 1, wherein the step of obtaining the edit distance of each pairwise combination result, extracting a phonetic close pinyin group according to the edit distance, and constructing a pinyin-phonetic close pinyin list comprises the steps of:
calculating the editing distance of the pairwise combination result;
judging whether the editing distance is greater than a preset distance;
if yes, the first pinyin and the second pinyin in the corresponding combination result are judged to be not near pinyin;
otherwise, judging that the first pinyin and the second pinyin in the corresponding combination result are phonetic near pinyin;
extracting all near-phonetic groups of the mutual near-phonetic alphabets, and constructing to obtain a pinyin-near-phonetic alphabet table.
3. The method for expanding the phonetic proximity sensitive words according to claim 1, wherein the candidate phonetic proximity words are pre-judged to realize the supplementary expansion of the sensitive word database, and the method comprises the following steps:
acquiring a judgment instruction of a candidate phonetic near word by manpower, and judging whether the judgment instruction is related to a preset sensitive condition;
when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive words into a sensitive word database;
and when the judgment instruction is not related to a preset sensitive condition, discarding the candidate sensitive word.
4. The method for extending a proximity-sensitive word as recited in claim 1, wherein the legal pinyin table and the pinyin-chinese table include only pronunciation pinyins and no pronunciation tones.
5. The method of claim 1, wherein combining every two pinyins in the legal pinyin table comprises:
performing first classification according to the pinyin length of each pinyin;
performing second classification on the pinyin in each first classification result according to the type of the pinyin initial letter;
obtaining pinyin sets with different pinyin lengths based on the first classification result and the second classification result;
and combining the pinyin in the pinyin set in pairs.
6. The method for expanding the proximity sensitive word according to claim 3, wherein when the judgment instruction is related to a preset sensitive condition, automatically supplementing the candidate sensitive word into a sensitive word database, comprising:
capturing a judgment instruction of the candidate phonetic proximal words by the manual work, and analyzing the judgment instruction to obtain a plurality of judgment conditions;
extracting the current judgment difference of each judgment condition;
judging whether the corresponding current judgment difference is within a preset difference range or not based on a preset sensitive condition, and if so, judging that the corresponding sub-judgment condition is qualified;
and when the sub-judgment conditions are qualified, the judgment instruction is related to a preset sensitive condition.
7. The method for expanding the phonetic proximity sensitive words according to claim 1, wherein the process of pre-judging the candidate phonetic proximity words to realize the supplementary expansion of the sensitive word database further comprises:
tracing the sensitive word database, extracting initial historical sensitive words related to the candidate phonetic near words, constructing initial derivative chains of different initial historical sensitive words, and optimizing the initial derivative chains by combining the crawling results of the network sensitive words to obtain historical derivative chains;
judging whether the initial sensitive words in each history derivative chain have alphabetic characters or not, regarding the history derivative chains with the alphabetic characters as a first type, and judging whether the rest history sensitive words in the history derivative chains without the alphabetic characters in the initial sensitive words have the alphabetic characters again;
according to the result of the second judgment, extracting the historical derivative chains without the alphabetic characters in the remaining sensitive words, and regarding the historical derivative chains as a second class, and meanwhile, regarding the remaining historical derivative chains as a third class;
determining an initial historical pinyin corresponding to the initial historical sensitive word in the historical derivative chain and a modified historical pinyin after each derivative, and obtaining historical pinyin tracing information;
counting historical search time sets and historical search frequency sets of different historical sensitive words in the same historical derivative chain, and constructing a first curve graph of the same historical derivative chain;
acquiring historical derivative total time length of the same historical derivative chain and derivative time length of each derivative stage according to the historical pinyin tracing information, and constructing to obtain a second graph;
pre-analyzing the first graph and the second graph based on a curve analysis model, and meanwhile, obtaining first sensitivity contribution degrees corresponding to different history sensitive words in the same history derivative chain and second sensitivity contribution degrees corresponding to the same history derivative chain by combining with the classification type of the same history derivative chain;
extracting first sensitive words with a first sensitivity contribution degree greater than a first preset sensitivity contribution degree from the same historical derivative chain, and sequencing all the first sensitive words in the same historical derivative chain according to the first sensitivity contribution degree;
when the second sensitivity contribution degree is greater than a second preset sensitivity degree, establishing a first mapping relation between the corresponding first sensitive word and the corresponding same historical derivative chain according to the sorting result;
otherwise, establishing a second mapping relation of the corresponding first sensitive words according to the sorting result;
capturing and analyzing all the first mapping relations and the second mapping relations based on a sensitive change capturing model, outputting to obtain a sensitive change rule, and constructing a sensitive vocabulary auxiliary change list based on the sensitive change rule;
determining a vocabulary adjustment type for the candidate phonetic proximal words based on the sensitive vocabulary auxiliary change list;
and determining corresponding adjusted sound near words based on the vocabulary adjustment type.
8. The method for extending phonosensitive words according to claim 7, wherein obtaining a first sensitivity contribution degree corresponding to different history sensitive words in a same history derivative chain and a second sensitivity contribution degree corresponding to the same history derivative chain comprises:
calculating the first sensitivity contribution degree B of different history sensitive words in the corresponding same history derivative chain according to the following formula i (t,p);
Figure FDA0003672731890000041
Where, sum { t i The total search duration of the historical search time set of the ith historical sensitive word in the corresponding same historical derivative chain is represented; t is t i A historical search time set representing a corresponding ith history sensitive word; sum { p } i Indicates the total search frequency of the historical search frequency set of the ith historical sensitive word in the corresponding same historical derivation chain; p is a radical of i A history search frequency set representing a corresponding ith history sensitive word; g i Representing the counted search frequency of the ith history sensitive word; t is i Representing the derivative time length corresponding to the ith history sensitive word in the corresponding same history derivative chain;
Figure FDA0003672731890000042
representing corresponding same historical derivative chain(iii) historical derived total duration;
Figure FDA0003672731890000043
representing the counted total search frequency of all history sensitive words in the corresponding history derivative chain;
calculating a second sensitivity contribution degree Y of the corresponding same historical derivative chain according to the following formula;
Figure FDA0003672731890000044
wherein the content of the first and second substances,
Figure FDA0003672731890000048
representing the classification type of the corresponding same historical derivative chain, and taking the value of 0.9 when the classification type is a first class, taking the value of 0.8 when the classification type is a second class, and taking the value of 1 when the classification type is a third class; n represents the total number of history sensitive words contained in the corresponding same history derivative chain; a. the i Representing the historical modification distance of the ith history sensitive word in the corresponding same history derivative chain; b i (t, p) represents a first sensitivity contribution degree obtained by the ith history sensitive word in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;
Figure FDA0003672731890000045
representing the total sensitivity contribution degree of all history sensitive words in the corresponding same history derivative chain based on the history search time set t and the history search frequency set p;
Figure FDA0003672731890000046
representing a time derivative factor corresponding to the ith history sensitive word in the same history derivative chain determined based on the second graph;
Figure FDA0003672731890000047
representing the same history determined based on the first graphA sensitivity contribution factor corresponding to the ith history sensitive word in the derivative chain;
and extracting the first sensitive words based on all the first sensitive contribution degrees and the second sensitive contribution degrees.
9. The method as claimed in claim 6, wherein the current judgment difference is a difference between the actual judgment result of the human being on the same sub-judgment condition and the preset judgment criterion of the sub-judgment condition.
10. The method for extending phonetic-near sensitive words according to claim 1, wherein the process of replacing any pinyin of a sensitive word in the sensitive word database based on the pinyin-phonetic-near pinyin table comprises:
determining a first word extension subset of each second sensitive word in the sensitive word database, and constructing a first extension sublist of each first word extension subset;
performing complete overlapping processing on sensitive words on each first extension sublist, and acquiring a plurality of second word extension sublenses with the maximum inclusion range, wherein the second word extension sublenses comprise: a first extended subset consisting of only one single sensitive word and a second extended subset consisting of a plurality of sensitive words;
determining a first number of the first extended subset and a second number of the second extended subset and comparing the first number with the second number;
if the first number is larger than or equal to the second number, preferentially setting a first replacement label for the first extension subset according to the use frequency of the sensitive words in the first extension subset, and replacing any character pinyin in the sensitive words in the corresponding first extension subset in sequence according to the label replacement time attribute of the first replacement label;
otherwise, preferentially determining a central sensitive word in the second extension subset, and determining the extension meaning of each extension sensitive word extended based on the central sensitive word, and a first pinyin modification position-time set and a second pinyin modification position-time set of each extension sensitive word based on the central sensitive word and adjacent sensitive words;
acquiring the extension degree of the corresponding extension sensitive word based on the extension meaning, the first pinyin modification position-time set and the second pinyin modification position-time set;
converting the extension degree of each extension sensitive word in the corresponding second extension subset into a corresponding extension sequence according to a standard conversion rule, and obtaining the extension sequence corresponding to the second extension subset according to the sequence of the extension sensitive words;
according to the size consistency of each sequence in the extension sequences, performing region splitting on the extension sequences to obtain a plurality of sub-regions, and respectively performing setting of second replacement labels on the corresponding sub-regions according to the mode sequence of each sub-region;
and sequentially replacing any character pinyin of the extension sensitive words in different sub-areas in the same second extension subset according to the label replacement time attribute of the second replacement label.
CN202210613700.2A 2022-05-31 2022-05-31 Expansion method of sound-proximity sensitive words Pending CN114969294A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210613700.2A CN114969294A (en) 2022-05-31 2022-05-31 Expansion method of sound-proximity sensitive words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210613700.2A CN114969294A (en) 2022-05-31 2022-05-31 Expansion method of sound-proximity sensitive words

Publications (1)

Publication Number Publication Date
CN114969294A true CN114969294A (en) 2022-08-30

Family

ID=82957411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210613700.2A Pending CN114969294A (en) 2022-05-31 2022-05-31 Expansion method of sound-proximity sensitive words

Country Status (1)

Country Link
CN (1) CN114969294A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659078A (en) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 Network information security monitoring method and system based on artificial intelligence
CN117892724A (en) * 2024-03-15 2024-04-16 成都赛力斯科技有限公司 Text detection method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659078A (en) * 2022-10-12 2023-01-31 湖北盈隆腾辉科技有限公司 Network information security monitoring method and system based on artificial intelligence
CN117892724A (en) * 2024-03-15 2024-04-16 成都赛力斯科技有限公司 Text detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109977416B (en) Multi-level natural language anti-spam text method and system
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN109670041A (en) A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods
US8239349B2 (en) Extracting data
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN114036933B (en) Information extraction method based on legal documents
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN114969294A (en) Expansion method of sound-proximity sensitive words
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
CN101520802A (en) Question-answer pair quality evaluation method and system
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN111460092A (en) Multi-document-based automatic complex problem solving method
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN110910175A (en) Tourist ticket product portrait generation method
Bedrick et al. Robust kaomoji detection in Twitter
CN111626041A (en) Music comment generation method based on deep learning
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN110413972B (en) Intelligent table name field name complementing method based on NLP technology
CN107797986A (en) A kind of mixing language material segmenting method based on LSTM CNN
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN107943783A (en) A kind of segmenting method based on LSTM CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination