CN109190014B - Regular expression generation method and device and electronic equipment - Google Patents

Regular expression generation method and device and electronic equipment Download PDF

Info

Publication number
CN109190014B
CN109190014B CN201810695221.3A CN201810695221A CN109190014B CN 109190014 B CN109190014 B CN 109190014B CN 201810695221 A CN201810695221 A CN 201810695221A CN 109190014 B CN109190014 B CN 109190014B
Authority
CN
China
Prior art keywords
search
search word
similar
bad
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810695221.3A
Other languages
Chinese (zh)
Other versions
CN109190014A (en
Inventor
黄腾玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810695221.3A priority Critical patent/CN109190014B/en
Publication of CN109190014A publication Critical patent/CN109190014A/en
Application granted granted Critical
Publication of CN109190014B publication Critical patent/CN109190014B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for generating a regular expression and electronic equipment, and the scheme comprises the following steps: the method comprises the steps of obtaining known bad search words, obtaining search words of a same click file searched by the known bad search words based on a search click bipartite graph, taking the search words as similar search words, extracting regular segments from the similar search words to obtain regular expressions serving as candidate regular expressions, matching the similar search words respectively by using the candidate regular expressions, and selecting a preset number of candidate regular expressions as the regular expressions for filtering the search words based on the hit condition of the candidate regular expressions on the similar search words participating in matching. By adopting the scheme provided by the embodiment of the invention, the existing regular expression can be continuously updated according to the candidate regular expression generated by the similar search words.

Description

Regular expression generation method and device and electronic equipment
Technical Field
The invention relates to the technical field of network information retrieval, in particular to a regular expression generation method and device and electronic equipment.
Background
With the continuous abundance of network resources, in order to provide more information to users, when users perform network information retrieval, search engines may actively recommend some search terms to users, where the search terms may be query suggestions, default search terms, and the like of users during search, such as: when the quotation marks are input in a search engine, the search engine defaults to have relevant recommendations such as the function of the quotation marks, the usage of the quotation marks and the like, and such search terms can be called as recommended search terms. Recommending search terms can facilitate user input and provide content of interest to the user. However, along with the popularization of networks, the coverage of network users is wider and wider, and some recommended search terms of the current search engine are still not suitable for being recommended to the users, such as: some search terms that relate to pornography and violence will have an adverse effect on minors, and such search terms that may have an adverse effect on the user may be referred to as adverse search terms. The search engine can filter the recommended search words to achieve the purpose of shielding bad information.
However, due to the fast network updating speed, some variants of some bad search terms appear after being filtered, the variants can also retrieve bad information, and the search engine needs to continuously perform filtering operation on the bad search terms and the variants thereof.
One common method at present is a regular method, in which a group of regular expressions is manually maintained, search terms matched by the regular expressions are identified as bad search terms, and the bad search terms are filtered, so as to achieve the purpose of shielding bad information.
In the prior art, the maintenance of the regular expression mainly depends on finding bad cases, and regular segments are extracted according to the found bad cases to obtain the regular expression, so that the existing regular expression is updated. However, because the occurrence time and the number of the bad cases are not regularly found, the existing regular expression cannot be continuously updated when the existing regular expression is maintained.
Disclosure of Invention
The embodiment of the invention aims to provide a regular expression generation method, a regular expression generation device and electronic equipment, which are used for continuously updating an existing regular expression according to similar search terms. The specific technical scheme is as follows:
the embodiment of the invention provides a regular expression generation method, which comprises the following steps:
obtaining known bad search terms;
acquiring search terms of a same click file searched by the known bad search terms based on a search click bipartite graph as similar search terms, wherein the search click bipartite graph represents a connection relation between the search terms and click files clicked by users in corresponding search results;
extracting regular segments from the similar search terms to obtain a regular expression as a candidate regular expression;
matching each similar search term respectively by using the candidate regular expressions;
for each similar search word participating in matching, determining whether the similar search word is hit or not based on a matching result and a determination result of whether the similar search word is determined to be a bad search word in other ways;
and selecting a preset number of candidate regular expressions as regular expressions for filtering the search terms based on the hit condition of the candidate regular expressions on each similar search term participating in matching.
Further, based on the search click bipartite graph, obtaining search terms of the same click file searched by the known bad search terms as similar search terms, including:
aiming at each known bad search word, acquiring each search word which is connected with the same click file with the known bad search word and is not determined as a bad search word in a search click bipartite graph as a search word to be screened;
calculating the similarity between the search word to be screened and the known bad search word aiming at each search word to be screened;
and selecting the search words to be screened with the similarity greater than a first preset threshold value as similar search words according to the similarity.
Further, based on the search click bipartite graph, obtaining search terms of the same click file searched by the known bad search terms as similar search terms, including:
in the search click bipartite graph, aiming at each known bad search word, judging the weight of the known bad search word, and selecting the known bad search word with the weight larger than a second preset threshold value as a specific bad search word;
based on the search click bipartite graph, aiming at each specific bad search word, obtaining each search word which is connected with the same click file and is not determined as a bad search word as a search word to be screened;
and aiming at each search word to be screened, acquiring each click file which is commonly connected with the specific bad search word and the search word to be screened, judging the weight of the side connected with the search word to be screened and the click file to be screened, and selecting the search word to be screened with the side weight larger than a third preset threshold value as a similar search word.
Further, before the using the candidate regular expression and respectively matching each of the similar search terms, the method further includes:
calculating the bad degree of the candidate regular expression based on the search click bipartite graph, wherein the bad degree represents the relevance between the candidate regular expression and the known bad search word;
the matching of each similar search term by using the candidate regular expression respectively comprises:
and respectively matching each similar search term by using the candidate regular expressions with the bad degrees meeting the preset conditions.
Further, calculating the degree of badness of the candidate regular expression based on the search click bipartite graph, including:
for each candidate regular expression, calculating the relevance between the candidate regular expression and the known poor search terms according to the similarity between each similar search term corresponding to the candidate regular expression and the known poor search terms;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure BDA0001713432000000031
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure BDA0001713432000000032
p represents the common connection of the similar search term and the known bad search termNumber of clicked files, WikWeight, W, representing the corresponding connecting edge of the similar search term and the kth click fileiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
Further, for each similar search term participating in matching, determining whether the similar search term hits based on the matching result and the determination result of whether the similar search term is determined to be a bad search term in other manners, includes:
if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other modes, determining that the similar search word is hit, otherwise, determining that the similar search word is not hit;
or, if the matching result of the similar search word is not matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is hit, otherwise, determining that the similar search word is not hit.
Further, for each similar search term participating in matching, determining whether the similar search term hits based on the matching result and the determination result of whether the similar search term is determined to be a bad search term in other manners, includes:
and if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other methods, determining that the similar search word is hit, and if the matching result of the similar search word is matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is not hit.
The embodiment of the invention provides a regular expression generating device, which comprises:
the bad search word acquisition module is used for acquiring known bad search words;
the similar search word acquisition module is used for acquiring search words of files which are searched to be the same as the known bad search words and are searched to be similar search words based on a search click bipartite graph, wherein the search click bipartite graph represents the connection relation between the search words and the clicked files clicked by the users in the corresponding search results;
the regular expression generation module is used for extracting regular segments from the similar search words to obtain a regular expression which is used as a candidate regular expression;
the matching module is used for respectively matching each similar search term by using the candidate regular expression;
the hit condition determining module is used for determining whether the similar search words are hit or not according to the matching result and the judgment result of whether the similar search words are judged to be bad search words or not in other modes aiming at each similar search word participating in matching;
and the regular expression selection module is used for selecting a preset number of candidate regular expressions as regular expressions for filtering the search terms based on the hit conditions of the candidate regular expressions on the similar search terms participating in matching.
Further, the similar search term obtaining module includes:
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the known bad search word and is not determined as a bad search word in the search click bipartite graph as the search word to be screened;
the similarity calculation operator module is used for calculating the similarity between the search word to be screened and the known bad search word aiming at each search word to be screened;
and the similar search word selecting submodule is used for selecting the search word to be screened with the similarity larger than a first preset threshold value as the similar search word according to the size of the similarity.
Further, the similar search term obtaining module includes:
the specific bad search word acquisition sub-module is used for judging the weight of each known bad search word in the search click bipartite graph, and selecting the known bad search word with the weight larger than a second preset threshold value as the specific bad search word;
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the same click file and is not determined as a bad search word aiming at each specific bad search word based on the search click bipartite graph, and the search word to be screened is used as the search word to be screened;
and the similar search word selection submodule is used for acquiring each click file which is formed by connecting the specific bad search word and the search word to be screened together aiming at each search word to be screened, judging the weight of the edge connecting the click files and the search word to be screened, and selecting the search word to be screened with the edge weight larger than a third preset threshold value as the similar search word.
Further, the above apparatus further includes:
the bad degree calculation module is used for calculating the bad degree of the candidate regular expression based on the search click bipartite graph, wherein the bad degree represents the relevance between the candidate regular expression and the known bad search word;
the matching module is specifically configured to match each similar search term respectively by using the candidate regular expressions whose degrees of badness satisfy preset conditions.
Further, the bad degree calculating module is specifically configured to calculate, for each candidate regular expression, a correlation between the candidate regular expression and the known bad search term according to a degree of similarity between each similar search term corresponding to the generated candidate regular expression and the known bad search term;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure BDA0001713432000000061
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure BDA0001713432000000062
p represents the number of click files of the similar search term and the known bad search term which are connected together, WikWeight, W, representing the corresponding connecting edge of the similar search term and the kth click fileiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
Further, the hit determination module includes:
a first hit condition determining sub-module, configured to determine that the similar search word is hit if a matching result of the similar search word is a match and the similar search word is determined to be a bad search word by other methods, and otherwise, determine that the similar search word is not hit;
and the second hit condition determining submodule is used for determining that the similar search word is hit if the matching result of the similar search word is not matched and the similar search word is judged to be a non-poor search word by other modes, or else, determining that the similar search word is not hit.
Further, the hit condition determining module is specifically configured to determine that the similar search word is hit if the matching result of the similar search word is a match and the similar search word is determined to be a bad search word by using another method, and determine that the similar search word is not hit if the matching result of the similar search word is a match and the similar search word is determined to be a non-bad search word by using another method.
The embodiment of the invention provides electronic equipment which is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for completing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of any regular expression generation method when executing the program stored in the memory.
The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the steps of any regular expression generation method are realized.
Embodiments of the present invention further provide a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the regular expression generation methods described above.
The regular expression generation method, the system and the electronic device provided by the embodiments of the present invention obtain known bad search words, obtain search words of a same click file retrieved from the known bad search words based on a search click bipartite graph as similar search words, extract regular segments of the similar search words to obtain regular expressions as candidate regular expressions, match the similar search words respectively by using the candidate regular expressions, and surely select a preset number of the candidate regular expressions as the regular expressions for filtering the search words based on the hit conditions of the candidate regular expressions on the similar search words participating in the matching. By the method, the existing regular expression can be continuously updated according to the similar search terms.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a flowchart of a regular expression generation method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a search click bipartite graph according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a regular expression generation apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
In the scheme provided by the embodiment of the invention, known bad search words are obtained, search words of a same click file searched by the known bad search words are obtained based on a search click bipartite graph and used as similar search words, regular segments are extracted from the similar search words to obtain regular expressions which are used as candidate regular expressions, the candidate regular expressions are used for respectively matching the similar search words, and a preset number of candidate regular expressions are selected as the regular expressions for filtering the search words based on the hit condition of the candidate regular expressions on the similar search words participating in matching. Therefore, the existing regular expressions are continuously updated according to the candidate regular expressions generated by the similar search words.
An embodiment of the present invention provides a regular expression generation method, as shown in fig. 1, which may include the following steps:
step S101, known bad search terms are obtained.
In this step, an existing regular expression set in the search engine library and a search word set searched by the user may be obtained, and for each search word, the search word is matched with each regular expression in the existing regular expression set, and a search word that can be matched with the existing regular expression is obtained as a known bad search word.
Further, assuming that an existing search word is "how to expire the windows license", and an existing regular expression is "outdated", the extracting of the regular segments from the search word may include "windows", "license _ expired", and "windows _ expired", and if the search word is matched with the search word by using the regular expression, since the search word conforms to a pattern defined by the regular expression, which includes the pattern of "outdated", the search word and the regular expression may be considered to be matched with each other.
And step S102, acquiring search terms of the same click file searched by the known bad search terms based on the search click bipartite graph, and taking the search terms as similar search terms.
In this step, for each known bad search term, according to the connection relationship between the search term and the clicked file in the search click bipartite graph, each search term which is connected with the same clicked file as the known bad search term and is not determined as the bad search term can be selected as a similar search term.
Specifically, the search click graph represents that: when a user searches in a search engine, the input search words and the connection relation between the click files selected by the user in the search results corresponding to the search words are obtained. In the search click bipartite graph, each search word can be correspondingly connected with one or more click files, and similarly, each click file can be correspondingly connected with one or more search words. Then, for each known bad search term, one or more search terms connected with the same click file can be found in the search click bipartite graph, the search terms belonging to the bad search terms are removed, and the rest search terms are used as similar search terms.
Further, the description will be given by taking fig. 2 as an example, in which query represents a search word and doc represents a click file. Taking query5 as an example, if the bad search term only has query5, as can be seen from the connection relationship in the figure, the click file connected with query5 only has doc3, and the search terms connected with doc3 have query1, query4 and query5, then query1 and query4 may be similar search terms corresponding to query 5.
Step S103, extracting regular segments from the similar search words to obtain regular expressions as candidate regular expressions.
In this step, for each similar search term, one or more regular segments may be extracted for the similar search term according to the candidate pattern; and obtaining a regular expression correspondingly aiming at each regular segment, and obtaining a set of candidate regular expressions according to the set of similar search terms.
Further, assuming that the similar search term is "influence of biodiversity on environmental bearing capacity", for extracting regular segments as many as possible, regular patterns of the priority quantifier may be specified as candidate patterns. By ". C1C2*”“*C1_C2C3"such candidate patterns are examples, wherein C1、C2、C3Representing different words, C1C2The term "a" and "a" may denote any character, which may be a word or a word, or even nothing. Extracting regular segments from similar search words according to candidate patterns, wherein when the candidate patterns are' C1C2"time, the corresponding canonical segments may be" creatures "," diversity _ impact ", and" environment _ impact "; when the candidate pattern is ". about.C1_C2C3"the corresponding regular segments may be" diversity "," bearing capacity "and" impact on ". And each regular segment can generate a candidate regular expression correspondingly.
Since the structure of the similar search term is generally simple, the candidate pattern may be a regular pattern having a simple structure extracted from a regular pattern library of a search engine according to the structure of the similar search term, or may be a regular pattern having a simple structure set according to the structure of the similar search term.
And step S104, matching each similar search word by using the candidate regular expressions.
In this step, for each candidate regular expression, the candidate regular expression is used to match each similar search term, for each similar search term, one or more regular segments are extracted from the similar search term, and if the characters in the regular segments match the patterns defined by the candidate regular expression, the characters and the patterns match each other.
Step S105, for each similar search word participating in matching, determining whether the similar search word is hit based on the matching result and the determination result of whether the similar search word is determined to be a bad search word in another manner.
In this step, for each similar search word participating in matching, a matching result between the similar search word and the candidate expression is obtained, and a determination result whether the similar search word is determined to be a bad search word in another manner is obtained.
Specifically, the above-mentioned determining as the bad search term in another way may be determining the similar search terms manually, and determining the similar search terms capable of searching the bad information as the bad search terms.
And step S106, selecting a preset number of candidate regular expressions for filtering the regular expressions of the search terms based on the hit condition of the candidate regular expressions on each similar search term participating in matching.
In this step, according to the hit condition of the candidate regular expressions on each similar search term participating in matching, a preset number of candidate regular expressions can be selected and updated to the existing regular expression set for filtering the search terms, other candidate regular expressions can be added to the blacklist, and the candidate regular expressions in the blacklist cannot be used as the candidate regular expressions again within a certain time.
Specifically, whether the candidate regular expression is selected as the regular expression for filtering the search terms is determined based on the hit condition of the candidate regular expression on each similar search term participating in matching, the number of hits of the candidate regular expression on the similar search terms can be counted, and a preset number of candidate regular expressions are selected as the regular expressions for filtering the search terms according to the number of hits.
In the embodiment of the present invention, the hit rate of the candidate regular expressions on each similar search term may also be calculated, and according to the size of the hit rate, a preset number of candidate regular expressions are selected as the regular expressions for filtering the search terms, wherein for each candidate regular expression, the hit rate may be represented as a ratio of the number of hit similar search terms to the number of similar search terms. For example: if two candidate regular expressions exist, wherein one candidate regular expression can hit 50 similar search words, and the other candidate regular expression can only hit 2 similar search words, the candidate regular expressions corresponding to the hit 50 similar search words can be selected as the regular expressions for filtering the search words; or calculating the hit rate of the two candidate regular expressions to the similar search words, and selecting the candidate regular expression with the hit rate of more than 70% as the regular expression for filtering the search words.
In this way, the regular segments can be extracted based on the similar search terms in the above steps to obtain the candidate regular expressions, and the existing regular expressions can be selectively updated from the candidate regular expressions.
In an embodiment of the regular expression generation method, in step S102, based on the search click bipartite graph, search terms of a same click file as a search term of a known bad search term are obtained, and the search terms are used as similar search terms, where a specific processing manner may be:
aiming at each known bad search word, acquiring each search word which is connected with the same click file with the known bad search word and is not determined as a bad search word from a search click bipartite graph as a search word to be screened;
calculating the similarity between the search word to be screened and the known bad search word aiming at each search word to be screened;
and selecting the search words to be screened with the similarity greater than a first preset threshold value as similar search words according to the similarity.
Specifically, in the search click bipartite graph, each search word has a weight of the corresponding word, and an edge connecting each search word and the click file has a weight of the corresponding edge, where the weight of a word may be the number of times the user searches for the word, and the weight of an edge may be the number of clicks of the user on the click file when searching for the corresponding search word. For example, if a search term has only a first document and a second document in a clicked document, and the number of times that the user searches for the term is counted as 200 times, the number of clicks for the first document is counted as 150 times, and the number of clicks for the second document is counted as 80 times, the weight of the term for the search term may be 200, the weight of the edge connecting the term to the first document is counted as 150 times, and the weight of the edge connecting the term to the second document is counted as 80 times.
And according to the connection relation between the search word and the clicked file in the search click bipartite graph, aiming at each known bad search word, acquiring each search word which is connected with the same clicked file as the known bad search word and is not determined as the bad search word in the search click bipartite graph as the search word to be screened.
And calculating the similarity degree of the search word to be screened and the bad search word as the similarity degree aiming at each search word to be screened according to the weight of the word and the weight of the edge in the search click bipartite graph. In the search click bipartite graph, because each known poor search word can obtain one or more search words to be screened, and each search word to be screened can be obtained according to one or more known poor search words, for the ith search word to be screened, the similarity S between the ith search word and the known poor search word is obtainediCan be expressed as:
Figure BDA0001713432000000121
m represents the number of known bad search words connected with the same click file with the search word to be screened, k represents the kth known bad search word in the m known bad search words, p represents the number of click files connected with the search word to be screened and the kth known bad search word together, and W represents the number of click files connected with the search word to be screened and the kth known bad search word togetherikWeight, W, representing the corresponding connecting edge of the search term to be screened and the kth click fileiThe sum of the weights of the corresponding connected edges of all the clicked files connected by the search term to be screened is represented, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
For each search term to be screened, according to the similarity of the search term to be screened, the search term to be screened with the similarity greater than a preset value can be used as a similar search term, for example, the search term to be screened with the similarity greater than 70%; or, the search terms to be filtered may be arranged in the order of similarity from large to small, and a predetermined number of search terms to be filtered are selected as similar search terms, for example, the search terms to be filtered corresponding to the search term to be filtered with the similarity rank in the top 20.
Therefore, by the method, the similar search words can be obtained by screening according to the similarity between the known bad search words and the search words to be screened, an implementation method is provided for determining the similar search words, meanwhile, the number of the similar search words is effectively controlled, and convenience is provided for determining the hit condition later.
In an embodiment of the regular expression generation method, in step S102, based on the search click bipartite graph, search terms of a same click file as a search term of a known bad search term are obtained, and the search terms are used as similar search terms, where a specific processing manner may also be:
in the search click bipartite graph, for each known bad search word, judging the weight of the known bad search word, and selecting the known bad search word with the weight larger than a second preset threshold value as a specific bad search word;
based on the search click bipartite graph, aiming at each specific bad search word, obtaining each search word which is connected with the same click file and is not determined as a bad search word, and using the search word as a search word to be screened;
and aiming at each search word to be screened, acquiring each click file which is commonly connected with a specific bad search word and the search word to be screened, judging the weight of the side connected with each click file and the search word to be screened, and selecting the search word to be screened with the side weight larger than a third preset threshold value as a similar search word.
Specifically, according to the weights of the words in the search click bipartite graph, the known bad search words with the weights of the words greater than a predetermined value can be preferentially selected as the specific bad search words according to the weights of the known bad search words, for example, the known bad search words with the weights of the words greater than 2000.
And aiming at each specific bad search word, acquiring one or more search words which are connected with the same click file with the specific bad search word and are not determined as bad search words in the search click bipartite graph as search words to be screened.
And aiming at each search word to be screened, acquiring a click file connected with a specific bad search word and a search to be screened in a search click bipartite graph, and preferentially selecting the search word to be screened, of which the side weight connected with the click file is greater than a threshold value, as a similar search word.
Further, taking fig. 2 as an example, if it is known that the poor search terms are only query3 and query4, the numbers on the line connecting query and doc in the figure represent the weight of the edge connecting the corresponding search term and the click file. In this case, if the weight of the word query3 is 1 and the weight of the word query4 is 10, the word query4 can be used as the specific poor search word according to the magnitude of the weight of the word. As can be seen from the figure, the click files connected to query4 include doc1, doc2 and doc3, and the search terms connected to query4 and having the same click file include query1, query2, query3 and query5, and because query3 is a known bad search term, the search terms to be screened only include query1, query2 and query 5. At this time, if the weight of the edge connected with the clicked file in the search terms to be screened is greater than 2, the search terms are similar search terms, and then, as can be seen from the figure, the similar search terms of query4 include query1, query2 and query 5.
Therefore, the similar search terms determined by the method according to the weight of the edge and the weight of the poor search terms are obviously more accurate in the control of the number and the degree of similarity with the poor search terms, and further the determination of the hit condition is facilitated.
In an embodiment of the regular expression generation method, before using the candidate regular expressions in step S104 and respectively matching the similar search terms, the method may further include:
calculating the bad degree of the candidate regular expression based on the search click bipartite graph, wherein the bad degree represents the relevance between the candidate regular expression and the known bad search word;
respectively matching the similar search terms by using the candidate regular expressions, wherein the matching comprises the following steps:
and respectively matching the similar search terms by using the candidate regular expressions with the bad degrees meeting the preset conditions.
Specifically, in the search click bipartite graph, for each known bad search term, one or more similar search terms may be obtained, and each similar search term may generate one or more regular segments, each regular segment corresponding to one candidate regular expression. Therefore, for each candidate regular expression, there is a certain relevance between the candidate regular expression and the known bad search word, and the relevance can be represented by the bad degree of the candidate regular expression.
According to the degree of the unhealthy degree of each candidate regular expression, the candidate regular expressions with the unhealthy degrees meeting preset conditions can be selected, and the similar search terms are matched. The preset condition may be a preset bad degree, such as candidate regular expressions with a bad degree greater than 10, or may also be the number of preset candidate regular expressions, such as the first 10 candidate regular expressions with a relatively large bad degree.
Therefore, the method mainly comprises the process of screening the candidate regular expressions, and the method mainly aims to delete the candidate regular expressions which are highly related to the known bad search terms, so that the hit condition of the candidate regular expressions can be counted conveniently.
In an embodiment of the regular expression generation method, in the search click-based bipartite graph, a degree of badness of the candidate regular expression is calculated, and a specific implementation manner of the method may be:
for each candidate regular expression, calculating the relevance between the candidate regular expression and the known poor search terms according to the similarity between each similar search term corresponding to the candidate regular expression and the known poor search terms;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure BDA0001713432000000151
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure BDA0001713432000000152
p represents the number of click files of the similar search term and the known bad search term which are connected together, WikIndicate the similarityWeight of the corresponding connecting edge of the search term and the kth click file, WiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
Further, taking fig. 2 as an example for explanation, assuming that the known bad search word is query4, and the similar search words are query1, query2, query3 and query5, there is a candidate regular expression, which is obtained by extracting regular segments according to query2 and query5, where the weights of the words corresponding to query2 and query5 are both 3, and the specific calculation process of the bad degree Z of the candidate regular expression is as follows:
Figure BDA0001713432000000161
from the above, the candidate regular expressions are mainly selected according to the relevance between the candidate regular expressions and the known bad search terms, and the bigger the relevance is, the better the filtering effect of the candidate regular expressions on the bad information is likely to be.
In an embodiment of the regular expression generation method, in step S105, for each similar search term participating in matching, determining whether the similar search term hits based on a matching result and a determination result of whether the similar search term is determined as a poor search term in other ways, an embodiment may include:
and if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other modes, determining that the similar search word is hit, and otherwise, determining that the similar search word is not hit.
Or, if the matching result of the similar search word is not matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is hit, otherwise, determining that the similar search word is not hit.
Specifically, in the process of respectively matching the similar search terms by using the candidate regular expressions, the following four cases may occur for the similar search terms: the conditions that the search words can be matched with each other and are judged to be bad by adopting other modes; the condition that the search terms cannot be matched with each other and are judged to be non-bad search terms by other modes; the conditions that the search terms can be matched with each other and are judged to be non-bad search terms in other modes; the search words cannot be matched with each other, and are determined to be bad search words in other ways. Regarding each candidate regular expression, if the former two conditions occur, the candidate regular expression is considered to hit the corresponding similar search word; if the latter two situations occur, the candidate regular expression is considered to miss the corresponding similar search term.
According to the method, whether the candidate regular expression can hit each similar search word participating in matching is judged, four conditions included in the candidate regular expression can comprehensively summarize all possible conditions, and the hit condition is determined more accurately.
In an embodiment of the regular expression generation method, in step S105, for each similar search term participating in matching, based on a matching result and a determination result of whether the similar search term is determined to be a poor search term in other ways, determining whether the similar search term hits, where another implementation may include:
and if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other methods, determining that the similar search word is hit, and if the matching result of the similar search word is matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is not hit.
In the second embodiment, in the process of respectively matching similar search terms by using the candidate regular expressions, only when the candidate regular expressions and the similar search terms are matched with each other, whether the similar search terms are hit is determined, which only occurs one case, that is, when each similar search term is matched with the candidate regular expressions, the similar search term is determined as a bad search term by using other methods, and then the similar search term is determined as a hit, whereas when the similar search term is matched with the candidate regular expressions, the similar search term is determined as a non-bad search term by using other methods, and then the similar search term is determined as a miss.
According to the method, whether the candidate regular expressions can hit each similar search word participating in matching or not is judged, and in the process of determining the hit condition, only whether the similar search words capable of being matched with the candidate regular expressions can be hit by the candidate regular expressions or not is considered for each candidate regular expression, so that the determination method is simpler and more convenient in actual operation.
In summary, according to the regular expression generation method provided by the embodiment of the present invention, the regular expression can be generated according to the similar search term, and the existing regular expression can be continuously updated.
Based on the same inventive concept, according to the regular expression generation method provided in the embodiments of the present invention, an embodiment of the present invention further provides a regular expression generation apparatus, as shown in fig. 3, including the following modules:
a bad search word obtaining module 201, configured to obtain a known bad search word;
the similar search term obtaining module 202 is configured to obtain, based on a search click bipartite graph, search terms of a same click file as a search term retrieved from a known bad search term, as similar search terms, where the search click bipartite graph represents a connection relationship between the search terms and click files clicked by users in search results corresponding to the search terms;
the regular expression generating module 203 is configured to extract regular segments from the similar search terms to obtain a regular expression, which is used as a candidate regular expression;
the matching module 204 is configured to match the similar search terms respectively by using the candidate regular expressions;
a hit condition determining module 205, configured to determine, for each similar search term participating in matching, whether the similar search term hits based on a matching result and a determination result of whether the similar search term is determined to be a bad search term in another manner;
the regular expression selection module 206 is configured to determine a preset number of candidate regular expressions as regular expressions for filtering the search terms based on hit conditions of the candidate regular expressions on the similar search terms participating in matching.
Further, the similar search term obtaining module 202 may include:
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the known bad search word and is not determined as the bad search word in the search click bipartite graph as the search word to be screened;
the similarity calculation operator module is used for calculating the similarity between each search word to be screened and a known bad search word;
and the similar search word selecting submodule is used for selecting the search word to be screened with the similarity larger than a first preset threshold value as the similar search word according to the size of the similarity.
Further, the similar search term obtaining module 202 may also include:
the specific bad search word acquisition sub-module is used for judging the weight of each known bad search word in the search click bipartite graph, and selecting the known bad search word with the weight larger than a second preset threshold value as the specific bad search word;
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the same click file and is not determined as a bad search word aiming at each specific bad search word based on the search click bipartite graph, and the search word is used as the search word to be screened;
and the similar search word selection sub-module is used for acquiring each click file which is commonly connected with the specific bad search word and the search word to be screened aiming at each search word to be screened, judging the weight of the side connected with each click file and the search word to be screened, and selecting the search word to be screened with the side weight larger than a third preset threshold value as the similar search word.
Further, the above apparatus may further include:
the bad degree calculation module is used for calculating the bad degree of the candidate regular expression based on the search click bipartite graph, and the bad degree represents the relevance between the candidate regular expression and the known bad search word;
the matching module 204 is specifically configured to match each similar search term respectively by using the candidate regular expressions whose degrees of badness meet the preset conditions.
Further, the bad degree calculating module is specifically configured to calculate, for each candidate regular expression, a correlation between the candidate regular expression and a known bad search word according to a degree of similarity between each similar search word corresponding to the candidate regular expression and the known bad search word;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure BDA0001713432000000191
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure BDA0001713432000000192
p represents the number of click files of the similar search term and the known bad search term which are connected together, WikWeight, W, representing the corresponding connecting edge of the similar search term and the kth click fileiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
Further, the hit determination module 205 may include:
a first hit condition determining sub-module, configured to determine that the similar search word is hit if a matching result of the similar search word is a match and the similar search word is determined to be a bad search word by other methods, and otherwise, determine that the similar search word is not hit;
and the second hit condition determining submodule is used for determining that the similar search word is hit if the matching result of the similar search word is not matched and the similar search word is judged to be a non-poor search word by other modes, or else, determining that the similar search word is not hit.
Further, the hit condition determining module 205 is specifically configured to determine that the similar search word is hit if the matching result of the similar search word is a match and the similar search word is determined as a bad search word by other methods, and determine that the similar search word is not hit if the matching result of the similar search word is a match and the similar search word is determined as a non-bad search word by other methods.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,
a memory 403 for storing a computer program;
the processor 401, when executing the program stored in the memory 403, implements the following steps:
obtaining known bad search terms;
based on a search click bipartite graph, obtaining search words of a same click file searched by known bad search words as similar search words, wherein the search click bipartite graph represents the connection relation between the search words and click files clicked by users in corresponding search results;
extracting regular segments from the similar search terms to obtain a regular expression as a candidate regular expression;
respectively matching the similar search terms by using a candidate regular expression;
for each similar search word participating in matching, determining whether the similar search word is hit or not based on a matching result and a determination result of whether the similar search word is determined to be a bad search word in other ways;
and selecting a preset number of candidate regular expressions as regular expressions for filtering the search words based on the hit condition of the candidate regular expressions on each similar search word participating in matching.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above-described methods for generating an expression.
In yet another embodiment, a computer program product containing instructions is also provided, which when run on a computer causes the computer to perform any of the regular expression generation methods in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments of devices, electronic devices, and the like, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (15)

1. A regular expression generation method, comprising:
obtaining known bad search terms;
acquiring search terms of a same click file searched by the known bad search terms based on a search click bipartite graph as similar search terms, wherein the search click bipartite graph represents a connection relation between the search terms and click files clicked by users in corresponding search results;
extracting regular segments from the similar search terms to obtain a regular expression as a candidate regular expression;
matching each similar search term respectively by using the candidate regular expressions;
for each similar search word participating in matching, determining whether the similar search word is hit or not based on a matching result and a determination result of whether the similar search word is determined to be a bad search word in other ways;
selecting a preset number of candidate regular expressions as regular expressions for filtering search terms based on the hit condition of the candidate regular expressions on each similar search term participating in matching;
the step of extracting regular segments from the similar search terms to obtain a regular expression as a candidate regular expression includes:
aiming at each similar search word, extracting one or more regular segments from the similar search word according to the candidate pattern; generating a regular expression corresponding to each regular segment as a candidate regular expression; the candidate pattern is a regular pattern of a simple structure determined according to the structure of the similar search word.
2. The method of claim 1, wherein the obtaining search terms of the same clicked document as the known bad search term based on the search click bipartite graph as similar search terms comprises:
aiming at each known bad search word, acquiring each search word which is connected with the same click file with the known bad search word and is not determined as a bad search word in a search click bipartite graph as a search word to be screened;
calculating the similarity between the search word to be screened and the known bad search word aiming at each search word to be screened;
and selecting the search words to be screened with the similarity greater than a first preset threshold value as similar search words according to the similarity.
3. The method of claim 1, wherein the obtaining search terms of the same clicked document as the known bad search term based on the search click bipartite graph as similar search terms comprises:
in the search click bipartite graph, aiming at each known bad search word, judging the weight of the known bad search word, and selecting the known bad search word with the weight larger than a second preset threshold value as a specific bad search word;
based on the search click bipartite graph, aiming at each specific bad search word, obtaining each search word which is connected with the same click file and is not determined as a bad search word as a search word to be screened;
and aiming at each search word to be screened, acquiring each click file which is commonly connected with the specific bad search word and the search word to be screened, judging the weight of the side connected with the search word to be screened and the click file to be screened, and selecting the search word to be screened with the side weight larger than a third preset threshold value as a similar search word.
4. The method of claim 1, wherein before said using said candidate regular expressions to match each of said similar search terms, respectively, further comprising:
calculating the bad degree of the candidate regular expression based on the search click bipartite graph, wherein the bad degree represents the relevance between the candidate regular expression and the known bad search word;
the matching of each similar search term by using the candidate regular expression respectively comprises:
and respectively matching each similar search term by using the candidate regular expressions with the bad degrees meeting the preset conditions.
5. The method of claim 4, wherein computing the degree of badness of the candidate regular expression based on the search click bipartite graph comprises:
for each candidate regular expression, calculating the relevance between the candidate regular expression and the known poor search terms according to the similarity between each similar search term corresponding to the candidate regular expression and the known poor search terms;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure FDA0003184306170000021
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure FDA0003184306170000031
p represents the number of click files of the similar search term and the known bad search term which are connected together, WikWeight, W, representing the corresponding connecting edge of the similar search term and the kth click fileiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkIndicates the knowledgeWeight of edge of bad search term correspondingly connected with kth click file, WjAnd representing the weight of the corresponding connected edge of all click files connected by the known bad search term.
6. The method according to claim 1, wherein the determining whether the similar search word hits or not based on the matching result and the determination result whether the similar search word is determined as a bad search word in other ways for each similar search word participating in matching comprises:
if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other modes, determining that the similar search word is hit, otherwise, determining that the similar search word is not hit;
or, if the matching result of the similar search word is not matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is hit, otherwise, determining that the similar search word is not hit.
7. The method according to claim 1, wherein the determining whether the similar search word hits or not based on the matching result and the determination result whether the similar search word is determined as a bad search word in other ways for each similar search word participating in matching comprises:
and if the matching result of the similar search word is matched and the similar search word is judged to be a bad search word by other methods, determining that the similar search word is hit, and if the matching result of the similar search word is matched and the similar search word is judged to be a non-bad search word by other methods, determining that the similar search word is not hit.
8. A regular expression generation apparatus, comprising:
the bad search word acquisition module is used for acquiring known bad search words;
the similar search word acquisition module is used for acquiring search words of files which are searched to be the same as the known bad search words and are searched to be similar search words based on a search click bipartite graph, wherein the search click bipartite graph represents the connection relation between the search words and the clicked files clicked by the users in the corresponding search results;
the regular expression generation module is used for extracting regular segments from the similar search words to obtain a regular expression which is used as a candidate regular expression;
the matching module is used for respectively matching each similar search term by using the candidate regular expression;
the hit condition determining module is used for determining whether the similar search words are hit or not according to the matching result and the judgment result of whether the similar search words are judged to be bad search words or not in other modes aiming at each similar search word participating in matching;
the regular expression selection module is used for selecting a preset number of candidate regular expressions as regular expressions for filtering search terms based on the hit condition of the candidate regular expressions on each similar search term participating in matching;
the regular expression generation module is specifically used for extracting one or more regular segments of each similar search term according to the candidate pattern; generating a regular expression corresponding to each regular segment as a candidate regular expression; the candidate pattern is a regular pattern of a simple structure determined according to the structure of the similar search word.
9. The apparatus of claim 8, wherein the similar search term obtaining module comprises:
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the known bad search word and is not determined as a bad search word in the search click bipartite graph as the search word to be screened;
the similarity calculation operator module is used for calculating the similarity between the search word to be screened and the known bad search word aiming at each search word to be screened;
and the similar search word selecting submodule is used for selecting the search word to be screened with the similarity larger than a first preset threshold value as the similar search word according to the size of the similarity.
10. The apparatus of claim 8, wherein the similar search term obtaining module comprises:
the specific bad search word acquisition sub-module is used for judging the weight of each known bad search word in the search click bipartite graph, and selecting the known bad search word with the weight larger than a second preset threshold value as the specific bad search word;
the search word to be screened acquiring submodule is used for acquiring each search word which is connected with the same click file and is not determined as a bad search word aiming at each specific bad search word based on the search click bipartite graph, and the search word to be screened is used as the search word to be screened;
and the similar search word selection submodule is used for acquiring each click file which is formed by connecting the specific bad search word and the search word to be screened together aiming at each search word to be screened, judging the weight of the edge connecting the click files and the search word to be screened, and selecting the search word to be screened with the edge weight larger than a third preset threshold value as the similar search word.
11. The apparatus of claim 8, further comprising:
the bad degree calculation module is used for calculating the bad degree of the candidate regular expression based on the search click bipartite graph, wherein the bad degree represents the relevance between the candidate regular expression and the known bad search word;
the matching module is specifically configured to match each similar search term respectively by using the candidate regular expressions whose degrees of badness satisfy preset conditions.
12. The apparatus according to claim 11, wherein the bad degree calculating module is specifically configured to, for each candidate regular expression, calculate a relevance between the candidate regular expression and the known bad search term according to a degree of similarity between each similar search term corresponding to the candidate regular expression and the known bad search term;
for the ith candidate regular expression, its degree of badness ZiExpressed as:
Figure FDA0003184306170000051
n represents the number of similar search terms corresponding to the candidate regular expression which can be generated, m represents the number of known bad search terms connected with the same clicked file as the similar search terms corresponding to the candidate regular expression which is generated, and SijRepresenting the degree of similarity between the jth known bad search term connected with the same click file as the ith similar search term, CjA weight representing the jth known bad search term;
wherein, aiming at the similarity degree S between the jth known bad search word connected with the same click file with the ith similar search wordijExpressed as:
Figure FDA0003184306170000061
p represents the number of click files of the similar search term and the known bad search term which are connected together, WikWeight, W, representing the corresponding connecting edge of the similar search term and the kth click fileiThe sum of the weights of the edges corresponding to all the clicked files connected by the similar search term, WjkWeight, W, of the edge connecting the known bad search term and the k-th click filejAll clicked document pairs representing the known bad search term linksThe weight of the edge that should be connected.
13. The apparatus of claim 8, wherein the hit determination module comprises:
a first hit condition determining sub-module, configured to determine that the similar search word is hit if a matching result of the similar search word is a match and the similar search word is determined to be a bad search word by other methods, and otherwise, determine that the similar search word is not hit;
and the second hit condition determining submodule is used for determining that the similar search word is hit if the matching result of the similar search word is not matched and the similar search word is judged to be a non-poor search word by other modes, or else, determining that the similar search word is not hit.
14. The apparatus according to claim 8, wherein the hit determination module is specifically configured to determine that the similar search word is hit if the matching result of the similar search word is a match and the similar search word is determined to be a bad search word by other means, and determine that the similar search word is not hit if the matching result of the similar search word is a match and the similar search word is determined to be a non-bad search word by other means.
15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.
CN201810695221.3A 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment Active CN109190014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810695221.3A CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810695221.3A CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109190014A CN109190014A (en) 2019-01-11
CN109190014B true CN109190014B (en) 2021-11-26

Family

ID=64948682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810695221.3A Active CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109190014B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083758A (en) * 2019-04-30 2019-08-02 闻康集团股份有限公司 A kind of medical treatment search engine data platform system
CN111292205B (en) * 2019-12-17 2021-05-25 东方微银科技股份有限公司 Judicial data analysis method, device, equipment and storage medium
CN113343715B (en) * 2021-06-29 2023-04-07 深圳前海微众银行股份有限公司 Method, device and equipment for automatically generating regular expression and storage medium
CN113656538A (en) * 2021-07-09 2021-11-16 深圳价值在线信息科技股份有限公司 Method and device for generating regular expression, computing equipment and storage medium
CN113656659A (en) * 2021-08-31 2021-11-16 上海观安信息技术股份有限公司 Data extraction method, device and system and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847242A (en) * 2010-05-27 2010-09-29 武汉大学 Method and system for automatically acquiring aliases of contraband on line
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140136517A1 (en) * 2012-11-10 2014-05-15 Chian Chiu Li Apparatus And Methods for Providing Search Results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847242A (en) * 2010-05-27 2010-09-29 武汉大学 Method and system for automatically acquiring aliases of contraband on line
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern

Also Published As

Publication number Publication date
CN109190014A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN109190014B (en) Regular expression generation method and device and electronic equipment
CN110162695B (en) Information pushing method and equipment
CN108304512B (en) Video search engine coarse sorting method and device and electronic equipment
KR101557294B1 (en) Search results ranking using editing distance and document information
CN109189990B (en) Search word generation method and device and electronic equipment
CN108376129B (en) Error correction method and device
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
JP2013504118A (en) Information retrieval based on query semantic patterns
US20160350370A1 (en) Search results based on a search history
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN103455758A (en) Method and device for identifying malicious website
CN115309954A (en) Data retrieval method, device, equipment and storage medium
CN108804550B (en) Query term expansion method and device and electronic equipment
CN109064067B (en) Financial risk operation subject determination method and device based on Internet
CN104615723A (en) Determining method and device of search term weight value
CN111400516B (en) Label determining method, electronic device and storage medium
CN113282831A (en) Search information recommendation method and device, electronic equipment and storage medium
US7599946B2 (en) Systems and methods for discovering frequently accessed subtrees
CN112836126A (en) Recommendation method and device based on knowledge graph, electronic equipment and storage medium
CN110442801B (en) Method and device for determining concerned users of target events
CN108596647B (en) Advertisement putting method and device and electronic equipment
CN107590176B (en) Evaluation index obtaining method and device and electronic equipment
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN111984867A (en) Network resource determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant