Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
As mentioned above, the addition of existing keyword rules is mainly by means of automatically mining keywords or manually adding keywords.
For automatic mining of keywords: on the basis of black samples, keywords are extracted by a keyword extraction algorithm such as TextRank, TF-IDF (Term Frequency-Inverse Document Frequency), LDA (late Dirichlet Allocation), and the like. However, the keywords extracted in this way may also be present in a large number in the white sample.
That is to say, the keyword rule is generated based on the keywords extracted in the above manner, and in practical application, normal information is easily treated as bad information, which affects user experience. For example, for information such as cash register, assuming that a keyword extracted based on a black sample is "credit card", then "credit card" often appears in normal information, if "credit card" is used as a keyword and a keyword rule is generated, normal information in which "credit card" appears is also recognized as bad information.
For manual addition of keywords: specifically, the technical or operator constructs keywords according to personal professional knowledge and accumulated experience, so as to combine keyword rules for each keyword. The manual mode is not only inefficient, but also cannot have comprehensive perception and control on the bad content of the whole internet due to the limitation of people. In this way, the keyword rule can only cover partial bad information.
The keyword rule mentioned in the present specification means at least one keyword. For example, for fund withdrawal class information, the following keyword rules may exist: beibei bei flower ^ seconds, bei flower ^ Wexin, Baitiao ^ Bank card; wherein, the beijiao flower ^ seconds back can be expressed as if there are two keywords of ' flower ^ and ' seconds back ' in the text information, then the text information has high risk.
It should be noted that, in this specification, the keyword rule may mainly be processed for text information; for multimedia information such as picture information, video information, and audio information, preprocessing is required. Specifically, the text information in the picture information and the video information may be recognized by an image Recognition technique such as an OCR (Optical Character Recognition), and then the recognized text information may be processed by applying a keyword rule. The audio information can be converted into text information by a speech recognition technology, and then the recognized text information is processed by applying a keyword rule.
An embodiment of a method for generating a keyword rule according to the present disclosure may be described below with reference to an example shown in fig. 1, where the method may include the following steps:
step 110: and determining a basic keyword.
In one embodiment, the base keyword may refer to an existing keyword. Or a keyword that is manually input.
In one embodiment, the basic keyword may also be automatically determined, for example, the basic keyword may be automatically crawled from a network or obtained from a database storing the basic keyword. For another example, the keywords obtained by automatically mining the keywords may be used as the basic keywords.
Step 120: and determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from the black samples and the white samples according to the basic keywords.
In one embodiment, the black sample may refer to bad information that has been identified. The white sample may be normal information (non-objectionable information) that has been identified.
In one embodiment, the keywords extracted from the black and white samples may be based on a keyword extraction technique, and then a similarity algorithm may be employed to calculate the similarity between the basic keywords and the keywords. Specifically, for the text-type information, text similarity algorithms such as the SimHash algorithm, the Jaccard (jackcard) similarity algorithm, the Cosine similarity algorithm, and the like may be employed. The keyword extraction techniques may include, for example, syntactic analysis algorithms, and the like. Since the keyword extraction algorithm and the similarity algorithm are commonly used in the field, no further description is given in this specification.
In an embodiment, the step 120 determines a black sample keyword set and a white sample keyword set similar to the basic keyword from the black sample and the white sample according to the basic keyword, and specifically includes:
calculating the similarity between the keywords extracted from the black sample and the basic keywords;
calculating the similarity between the keywords extracted from the white sample and the basic keywords;
determining a preset number of keywords with the highest similarity value in the black sample as a black sample keyword set;
and determining the preset number of keywords with the highest similarity value in the white samples as a white sample keyword set.
In this embodiment, the preset number may be an empirical value that is preset. As described above, the similarity value between the extracted keywords in the black sample and the basic keywords can be calculated by the similarity algorithm; similarly, the similarity value between the extracted keywords in the white sample and the basic keywords can be calculated through a similarity calculation method. Generally, the higher the similarity value is, the more similar the description keyword is to the basic keyword; therefore, a preset number of keywords with the highest similarity value in the black sample can be determined as a black sample keyword set; the preset number of keywords with the highest similarity value in the white sample may be determined as the white sample keyword set.
In an embodiment, the step 120 determines a black sample keyword set and a white sample keyword set similar to the basic keyword from the black sample and the white sample according to the basic keyword, and specifically includes:
calculating the similarity between the keywords extracted from the black sample and the basic keywords;
calculating the similarity between the keywords extracted from the white sample and the basic keywords;
determining keywords with similarity values larger than a threshold value in the black sample as a black sample keyword set;
and determining the keywords with similarity values larger than a threshold value in the white sample as a white sample keyword set.
In this embodiment, as described above, the similarity value between the extracted keyword in the black sample and the basic keyword can be calculated by the similarity algorithm; similarly, the similarity value between the extracted keywords in the white sample and the basic keywords can be calculated through a similarity calculation method. Generally, the higher the similarity value is, the more similar the description keyword is to the basic keyword; therefore, the keywords with similarity values larger than the threshold value in the black sample can be determined as the black sample keyword set; keywords in the white sample having a similarity value greater than a threshold value may be determined as a white sample keyword set.
In one embodiment, the threshold may be manually preset;
with the continuous development of computer technology, especially the progress of artificial intelligence, the threshold value can also be calculated through machine learning. For example, an optimal threshold value may be calculated by a machine learning algorithm.
Still further, the threshold may be calculated based on big data techniques. For example, the number of times that the thresholds are all N is found to be the largest through the mass data, and the effect is the best, the threshold may be set to N in this embodiment.
Step 130: and calculating the intersection of the black sample keyword set and the white sample keyword set.
By calculating the intersection of the black sample keyword set and the white sample keyword set, the elements in the keyword set in the black and white sample can be extracted.
For example, assume that the underlying keyword is determined to be "early"; determining a black sample keyword set similar to 'previous period' from the black samples as { none, expense, need, loan, borrow, find, down, add, money, credit card, credit, above, amount };
determining a white sample keyword set similar to the 'early stage' from the white samples as { late stage, none, expense, do, lower, high, add, cheat, money, people, want, present, suggestion, credit, investment, above, finding, early stage };
then the intersection of the set of black sample keywords and the set of white sample keywords is calculated as: { none, cost, find, down, add, money, letter, above }.
Step 140: and calculating a difference set of the intersection set and the black sample keyword set.
By calculating the difference set of the intersection set and the black sample keyword set, the elements in the keyword set which are simultaneously located in the black and white sample can be deleted from the black sample keyword set, and then the elements in the difference set only exist in the black sample.
The intersection set { none, expense, finding, adding, money, letter, above } obtained by the calculation in the previous step is used;
and calculating the difference between the intersection and the black sample keyword set { none, expense, need, loan, borrow, find, add, money, credit card, credit, over, amount } as { need, loan, borrow, credit card, amount }.
Step 150: and generating a keyword rule according to the difference set and the basic keywords.
In an embodiment, the step 150 generates a keyword rule according to the difference set and the basic keyword, and specifically includes:
determining a subset corresponding to each combination mode of the elements in the difference set;
and combining the basic keywords with each subset to obtain a keyword rule.
Obtaining a subset of elements corresponding to each combination mode by arranging and combining the elements in the difference set; the elements in these subsets can be combined with the underlying keywords into keyword rules.
Continuing to use the difference set { needs, loan, borrowing, credit card, amount } obtained in the previous step; firstly, determining a subset corresponding to each combination mode of elements in the difference set:
when the number of elements in the subset is 1, the combination mode has C51-5, the subset comprising: { required }, { loan }, { borrowing }, { credit card }, and { limit };
when the number of elements in the subset is 2, the combination mode has C52-10, the subset comprising: { requiring ^ loan }, { requiring ^ credit card }, { requiring ^ amount }, { lending ^ loan }, { lending ^ credit card }, { lending ^ amount }, { credit card amount }.
When the number of elements in the subset is 3, the combination mode has C53-10, the subset comprising: { requiring ^ loan ^ borrowing }, { requiring ^ loan ^ credit card }, { requiring ^ loan ^ amount }, { requiring ^ borrowing ^ credit card }, { requiring ^ borrowing ^ amount }, { requiring ^ credit card ^ amount }, { requiring ^ borrowing ^ credit card }, { lending ^ borrowing ^ amount }, { lending ^ credit card ^ amount }, { borrowing ^ credit card ^ amount }.
When the number of elements in the subset is 4, the combination mode has C54-5, the subset comprising: { requiring ^ loan ^ borrowing ^ credit card }, { requiring ^ loan ^ borrowing ^ limit }, { requiring ^ loan ^ credit card ^ limit }, { requiring ^ borrowing ^ credit card ^ limit }, and { loan ^ borrowing ^ credit card ^ limit }.
When the number of elements in the subset is 5, the combination mode has C5 51, the subset includes: { requiring loan credit limit }.
Finally, combining the basic keyword 'early stage' with the subsets to obtain a keyword rule:
{ the < A > requirement in the front ^ }, { the < A > loan in the front ^ loan }, { the < A > loan in the front ^ debit }, { the < A > requirement in the front ^ debit }, { the < A > credit card in the front ^ debit in the front, the < A > loan in the front ^ credit card in the front, the < A > loan in the front ^ debit in the front, the < A > loan in the front of the < A > loan < A > line }, { the < A > loan in the front of the front, the < A > loan < A > credit card in the front, the < A > loan line } debit in the front, the < A > loan line } credit card in the front, the < A > loan line of the < A > loan < A > debit in the front, the < A > loan line }, { the < A > debit in the front, the < A > loan line of the back < A > credit card in the front, the back, the < A > loan line, the number of the < A < B < A > credit card, the front, the number of the back < A < B < A > credit card, the number of, { the earlier phase ^ loan ^ limit }, { the earlier phase ^ loan ^ credit card ^ limit }, { the earlier phase ^ requires ^ loan ^ credit card }, { the earlier phase ^ requires ^ loan ^ debit ^ limit }, { the earlier phase ^ requires ^ loan ^ credit card ^ limit }, { the earlier phase ^ requires ^ loan ^ debit ^ credit card ^ limit }, { the earlier phase ^ loan ^ credit card ^ limit }.
Taking the keyword rule { earlier stage ^ need } as an example, when the information to be detected has 'earlier stage' and 'need', the information can be determined as bad information.
According to the embodiment of the specification, similar keywords are obtained in a black and white sample based on the determined basic keywords, namely a black sample keyword set and a white sample keyword set are determined; by calculating the intersection of the black sample keyword set and the white sample keyword set, elements in the keyword set which are positioned in the black and white sample at the same time can be extracted; then, by calculating a difference set between the intersection and the black sample keyword set, the elements in the keyword set located in the black and white sample at the same time can be deleted from the black sample keyword set, and the elements in the difference set exist only in the black sample. The generated keyword rule is specially used for the black sample, so that the accuracy of the keyword rule is improved; and the efficiency is higher than that of manual work.
Although the influence of the white samples has been eliminated by the keyword rules generated by the above embodiment, the accuracy of the keyword rules is improved only for the black samples, and then the generated keyword rules are not applied, and may still not achieve high accuracy. Therefore, on the basis of the embodiment shown in fig. 1, the method may further include the following steps:
performing keyword retrieval in the black and white sample according to the keyword rule;
counting the number of black samples hit by the keyword rule and the number of white samples hit by the keyword rule;
calculating the accuracy of the keyword rule according to the hit black sample number and the hit white sample number;
deleting the keyword rule if the accuracy does not exceed a threshold.
In the embodiment, the keyword rules are searched in the black and white samples, and the accuracy of the keyword rules is calculated according to the distribution condition of the hit samples, so that the keyword rules with low accuracy can be deleted, only the keyword rules with high accuracy are reserved, and the accuracy is further improved.
For example, taking the keyword rule { prophase ^ need } as an example, assume that the number of hit black samples is N and the number of hit white samples is M; the accuracy is X ═ N/(N + M); if X reaches the threshold value, the accuracy of the keyword rule is high, and the keyword rule can be used; if X does not exceed the threshold, the accuracy of the keyword rule is not high, and the keyword rule can not be used.
Corresponding to the embodiment of the keyword rule generation method, the present specification also provides an embodiment of a keyword rule generation apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor of the device where the software implementation is located as a logical means. In terms of hardware, a hardware structure of the device in which the keyword rule generating apparatus of this specification is located may include, as shown in fig. 2, a processor, a network interface, a memory, and a nonvolatile memory, and the device in which the apparatus is located in the embodiment generally generates an actual function according to the keyword rule, and may also include other hardware, which is not described again.
Referring to fig. 3, a block diagram of a keyword rule generating apparatus according to an embodiment of the present disclosure is shown, where the apparatus includes:
a first determining unit 210 that determines a basic keyword;
a second determining unit 220 for determining a black sample keyword set and a white sample keyword set similar to the basic keyword from the black sample and the white sample according to the basic keyword;
a first calculating unit 230 that calculates an intersection of the black sample keyword set and the white sample keyword set;
a second calculating unit 240 that calculates a difference set between the intersection and the black sample keyword set;
the generating unit 250 generates a keyword rule according to the difference set and the basic keyword.
In an alternative embodiment:
the second determining unit 220 specifically includes:
the first calculating subunit calculates the similarity between the key words extracted from the black samples and the basic key words;
the second calculating subunit calculates the similarity between the key words extracted from the white samples and the basic key words;
the first determining subunit determines a preset number of keywords with the highest similarity value in the black sample as a black sample keyword set;
and the second determining subunit determines the preset number of keywords with the highest similarity value in the white samples as the white sample keyword set.
In an alternative embodiment:
the second determining unit 220 specifically includes:
the first calculating subunit calculates the similarity between the key words extracted from the black samples and the basic key words;
the second calculating subunit calculates the similarity between the key words extracted from the white samples and the basic key words;
the first determining subunit determines the keywords with the similarity values larger than a threshold value in the black sample as a black sample keyword set;
and the second determining subunit determines the keywords with the similarity values larger than the threshold value in the white samples as the white sample keyword set.
In an alternative embodiment:
the generating unit 250 specifically includes:
a third determining subunit, configured to determine a subset corresponding to each combination manner of the elements in the difference set;
and generating a subunit, and combining the basic keyword with each subset to obtain a keyword rule.
In an alternative embodiment:
the device further comprises:
the retrieval subunit is used for performing keyword retrieval in the black and white sample according to the keyword rule;
a counting subunit, for counting the number of black samples hit by the keyword rule and the number of white samples hit by the keyword rule;
the calculating subunit is used for calculating the accuracy of the keyword rule according to the hit black sample number and the hit white sample number;
and the deleting subunit deletes the keyword rule under the condition that the accuracy does not exceed a threshold value.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
Fig. 3 above describes the internal functional modules and the structural schematic of the keyword rule generating apparatus, and the substantial execution subject may be an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
determining a basic keyword;
determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from a black sample and a white sample according to the basic keywords;
calculating the intersection of the black sample keyword set and the white sample keyword set;
calculating a difference set of the intersection and the black sample keyword set;
and generating a keyword rule according to the difference set and the basic keywords.
Optionally, the determining, according to the basic keyword, a black sample keyword set and a white sample keyword set similar to the basic keyword from a black sample and a white sample specifically includes:
calculating the similarity between the keywords extracted from the black sample and the basic keywords;
calculating the similarity between the keywords extracted from the white sample and the basic keywords;
determining a preset number of keywords with the highest similarity value in the black sample as a black sample keyword set;
and determining the preset number of keywords with the highest similarity value in the white samples as a white sample keyword set.
Optionally, the determining, according to the basic keyword, a black sample keyword set and a white sample keyword set similar to the basic keyword from a black sample and a white sample specifically includes:
calculating the similarity between the keywords extracted from the black sample and the basic keywords;
calculating the similarity between the keywords extracted from the white sample and the basic keywords;
determining keywords with similarity values larger than a threshold value in the black sample as a black sample keyword set;
and determining the keywords with similarity values larger than a threshold value in the white sample as a white sample keyword set.
Optionally, the generating a keyword rule according to the difference set and the basic keyword specifically includes:
determining a subset corresponding to each combination mode of the elements in the difference set;
and combining the basic keywords with each subset to obtain a keyword rule.
Optionally, the method further includes:
performing keyword retrieval in the black and white sample according to the keyword rule;
counting the number of black samples hit by the keyword rule and the number of white samples hit by the keyword rule;
calculating the accuracy of the keyword rule according to the hit black sample number and the hit white sample number;
deleting the keyword rule if the accuracy does not exceed a threshold.
In the above embodiments of the electronic device, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiment of the electronic device, since it is substantially similar to the embodiment of the method, the description is simple, and for the relevant points, reference may be made to part of the description of the embodiment of the method.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.