CN108563713B

CN108563713B - Keyword rule generation method and device and electronic equipment

Info

Publication number: CN108563713B
Application number: CN201810268866.9A
Authority: CN
Inventors: 周书恒
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2021-08-10
Anticipated expiration: 2038-03-29
Also published as: CN108563713A

Abstract

The embodiment of the specification provides a keyword rule generation method, a keyword rule generation device and electronic equipment, wherein the method comprises the following steps: determining a basic keyword; determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from a black sample and a white sample according to the basic keywords; calculating the intersection of the black sample keyword set and the white sample keyword set; calculating a difference set of the intersection and the black sample keyword set; and generating a keyword rule according to the difference set and the basic keywords.

Description

Keyword rule generation method and device and electronic equipment

Technical Field

The embodiment of the specification relates to the technical field of internet, in particular to a keyword rule generation method and device and electronic equipment.

Background

The internet generates a huge amount of information in various forms, such as text, pictures, video, audio, and the like, every day. These generated messages are often ameliorative. Some of the information may be illegal information; some of the information may be illegal information, such as advertisement information that is endless and varied. Generally, these pieces of information may be collectively referred to as bad information.

In order to maintain the purity of the internet environment and improve the experience of users on the internet, the bad information needs to be identified and processed. Generally, the bad information can be handled in a keyword rule manner. That is, when all keywords in the keyword rule exist in the generated information, the information is considered as bad information and is masked or deleted. The existing keyword rule is mainly added in a mode of automatically mining keywords or manually adding the keywords. However, the automatic keyword mining has a problem of low accuracy although the speed is high, and the manual keyword adding is low in efficiency although the accuracy is high.

It is desirable to provide a keyword rule generation scheme that is both accurate and efficient.

Disclosure of Invention

The embodiment of the specification provides a keyword rule generation method and device and an electronic device:

according to a first aspect of embodiments of the present specification, there is provided a keyword rule generation method, including:

determining a basic keyword;

determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from a black sample and a white sample according to the basic keywords;

calculating the intersection of the black sample keyword set and the white sample keyword set;

calculating a difference set of the intersection and the black sample keyword set;

and generating a keyword rule according to the difference set and the basic keywords.

According to a second aspect of embodiments of the present specification, there is provided a keyword rule generating apparatus, the apparatus including:

a first determination unit that determines a basic keyword;

the second determining unit is used for determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from the black samples and the white samples according to the basic keywords;

the first calculation unit is used for calculating the intersection of the black sample keyword set and the white sample keyword set;

the second calculation unit is used for calculating a difference set of the intersection and the black sample keyword set;

and the generating unit is used for generating a keyword rule according to the difference set and the basic keyword.

According to a fourth aspect of embodiments herein, there is provided an electronic apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement any of the keyword rule generating methods described above.

In the embodiment of the specification, similar keywords are obtained from a black and white sample based on the determined basic keywords, namely a black sample keyword set and a white sample keyword set are determined; by calculating the intersection of the black sample keyword set and the white sample keyword set, elements in the keyword set which are positioned in the black and white sample at the same time can be extracted; then, by calculating a difference set between the intersection and the black sample keyword set, the elements in the keyword set located in the black and white sample at the same time can be deleted from the black sample keyword set, and the elements in the difference set exist only in the black sample. The generated keyword rule is specially used for the black sample, and the accuracy of the keyword rule is improved.

Drawings

Fig. 1 is a flowchart of a keyword rule generation method provided in an embodiment of the present specification;

fig. 2 is a hardware configuration diagram of a keyword rule generating apparatus provided in an embodiment of the present specification;

fig. 3 is a schematic block diagram of a keyword rule generating apparatus according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

As mentioned above, the addition of existing keyword rules is mainly by means of automatically mining keywords or manually adding keywords.

For automatic mining of keywords: on the basis of black samples, keywords are extracted by a keyword extraction algorithm such as TextRank, TF-IDF (Term Frequency-Inverse Document Frequency), LDA (late Dirichlet Allocation), and the like. However, the keywords extracted in this way may also be present in a large number in the white sample.

That is to say, the keyword rule is generated based on the keywords extracted in the above manner, and in practical application, normal information is easily treated as bad information, which affects user experience. For example, for information such as cash register, assuming that a keyword extracted based on a black sample is "credit card", then "credit card" often appears in normal information, if "credit card" is used as a keyword and a keyword rule is generated, normal information in which "credit card" appears is also recognized as bad information.

For manual addition of keywords: specifically, the technical or operator constructs keywords according to personal professional knowledge and accumulated experience, so as to combine keyword rules for each keyword. The manual mode is not only inefficient, but also cannot have comprehensive perception and control on the bad content of the whole internet due to the limitation of people. In this way, the keyword rule can only cover partial bad information.

The keyword rule mentioned in the present specification means at least one keyword. For example, for fund withdrawal class information, the following keyword rules may exist: beibei bei flower ^ seconds, bei flower ^ Wexin, Baitiao ^ Bank card; wherein, the beijiao flower ^ seconds back can be expressed as if there are two keywords of ' flower ^ and ' seconds back ' in the text information, then the text information has high risk.

It should be noted that, in this specification, the keyword rule may mainly be processed for text information; for multimedia information such as picture information, video information, and audio information, preprocessing is required. Specifically, the text information in the picture information and the video information may be recognized by an image Recognition technique such as an OCR (Optical Character Recognition), and then the recognized text information may be processed by applying a keyword rule. The audio information can be converted into text information by a speech recognition technology, and then the recognized text information is processed by applying a keyword rule.

An embodiment of a method for generating a keyword rule according to the present disclosure may be described below with reference to an example shown in fig. 1, where the method may include the following steps:

step 110: and determining a basic keyword.

In one embodiment, the base keyword may refer to an existing keyword. Or a keyword that is manually input.

In one embodiment, the basic keyword may also be automatically determined, for example, the basic keyword may be automatically crawled from a network or obtained from a database storing the basic keyword. For another example, the keywords obtained by automatically mining the keywords may be used as the basic keywords.

Step 120: and determining a black sample keyword set and a white sample keyword set which are similar to the basic keywords from the black samples and the white samples according to the basic keywords.

In one embodiment, the black sample may refer to bad information that has been identified. The white sample may be normal information (non-objectionable information) that has been identified.

In one embodiment, the keywords extracted from the black and white samples may be based on a keyword extraction technique, and then a similarity algorithm may be employed to calculate the similarity between the basic keywords and the keywords. Specifically, for the text-type information, text similarity algorithms such as the SimHash algorithm, the Jaccard (jackcard) similarity algorithm, the Cosine similarity algorithm, and the like may be employed. The keyword extraction techniques may include, for example, syntactic analysis algorithms, and the like. Since the keyword extraction algorithm and the similarity algorithm are commonly used in the field, no further description is given in this specification.

In an embodiment, the step 120 determines a black sample keyword set and a white sample keyword set similar to the basic keyword from the black sample and the white sample according to the basic keyword, and specifically includes:

calculating the similarity between the keywords extracted from the black sample and the basic keywords;

calculating the similarity between the keywords extracted from the white sample and the basic keywords;

determining a preset number of keywords with the highest similarity value in the black sample as a black sample keyword set;

and determining the preset number of keywords with the highest similarity value in the white samples as a white sample keyword set.

In this embodiment, the preset number may be an empirical value that is preset. As described above, the similarity value between the extracted keywords in the black sample and the basic keywords can be calculated by the similarity algorithm; similarly, the similarity value between the extracted keywords in the white sample and the basic keywords can be calculated through a similarity calculation method. Generally, the higher the similarity value is, the more similar the description keyword is to the basic keyword; therefore, a preset number of keywords with the highest similarity value in the black sample can be determined as a black sample keyword set; the preset number of keywords with the highest similarity value in the white sample may be determined as the white sample keyword set.

determining keywords with similarity values larger than a threshold value in the black sample as a black sample keyword set;

and determining the keywords with similarity values larger than a threshold value in the white sample as a white sample keyword set.

In this embodiment, as described above, the similarity value between the extracted keyword in the black sample and the basic keyword can be calculated by the similarity algorithm; similarly, the similarity value between the extracted keywords in the white sample and the basic keywords can be calculated through a similarity calculation method. Generally, the higher the similarity value is, the more similar the description keyword is to the basic keyword; therefore, the keywords with similarity values larger than the threshold value in the black sample can be determined as the black sample keyword set; keywords in the white sample having a similarity value greater than a threshold value may be determined as a white sample keyword set.

In one embodiment, the threshold may be manually preset;

with the continuous development of computer technology, especially the progress of artificial intelligence, the threshold value can also be calculated through machine learning. For example, an optimal threshold value may be calculated by a machine learning algorithm.

Still further, the threshold may be calculated based on big data techniques. For example, the number of times that the thresholds are all N is found to be the largest through the mass data, and the effect is the best, the threshold may be set to N in this embodiment.

Step 130: and calculating the intersection of the black sample keyword set and the white sample keyword set.

By calculating the intersection of the black sample keyword set and the white sample keyword set, the elements in the keyword set in the black and white sample can be extracted.

For example, assume that the underlying keyword is determined to be "early"; determining a black sample keyword set similar to 'previous period' from the black samples as { none, expense, need, loan, borrow, find, down, add, money, credit card, credit, above, amount };

determining a white sample keyword set similar to the 'early stage' from the white samples as { late stage, none, expense, do, lower, high, add, cheat, money, people, want, present, suggestion, credit, investment, above, finding, early stage };

then the intersection of the set of black sample keywords and the set of white sample keywords is calculated as: { none, cost, find, down, add, money, letter, above }.

Step 140: and calculating a difference set of the intersection set and the black sample keyword set.

By calculating the difference set of the intersection set and the black sample keyword set, the elements in the keyword set which are simultaneously located in the black and white sample can be deleted from the black sample keyword set, and then the elements in the difference set only exist in the black sample.

The intersection set { none, expense, finding, adding, money, letter, above } obtained by the calculation in the previous step is used;

and calculating the difference between the intersection and the black sample keyword set { none, expense, need, loan, borrow, find, add, money, credit card, credit, over, amount } as { need, loan, borrow, credit card, amount }.

Step 150: and generating a keyword rule according to the difference set and the basic keywords.

In an embodiment, the step 150 generates a keyword rule according to the difference set and the basic keyword, and specifically includes:

determining a subset corresponding to each combination mode of the elements in the difference set;

and combining the basic keywords with each subset to obtain a keyword rule.

Obtaining a subset of elements corresponding to each combination mode by arranging and combining the elements in the difference set; the elements in these subsets can be combined with the underlying keywords into keyword rules.

Continuing to use the difference set { needs, loan, borrowing, credit card, amount } obtained in the previous step; firstly, determining a subset corresponding to each combination mode of elements in the difference set:

when the number of elements in the subset is 1, the combination mode has C₅1-5, the subset comprising: { required }, { loan }, { borrowing }, { credit card }, and { limit };

when the number of elements in the subset is 2, the combination mode has C₅2-10, the subset comprising: { requiring ^ loan }, { requiring ^ credit card }, { requiring ^ amount }, { lending ^ loan }, { lending ^ credit card }, { lending ^ amount }, { credit card amount }.

When the number of elements in the subset is 3, the combination mode has C₅3-10, the subset comprising: { requiring ^ loan ^ borrowing }, { requiring ^ loan ^ credit card }, { requiring ^ loan ^ amount }, { requiring ^ borrowing ^ credit card }, { requiring ^ borrowing ^ amount }, { requiring ^ credit card ^ amount }, { requiring ^ borrowing ^ credit card }, { lending ^ borrowing ^ amount }, { lending ^ credit card ^ amount }, { borrowing ^ credit card ^ amount }.

When the number of elements in the subset is 4, the combination mode has C₅4-5, the subset comprising: { requiring ^ loan ^ borrowing ^ credit card }, { requiring ^ loan ^ borrowing ^ limit }, { requiring ^ loan ^ credit card ^ limit }, { requiring ^ borrowing ^ credit card ^ limit }, and { loan ^ borrowing ^ credit card ^ limit }.

When the number of elements in the subset is 5, the combination mode has C₅ ⁵1, the subset includes: { requiring loan credit limit }.

Finally, combining the basic keyword 'early stage' with the subsets to obtain a keyword rule:

{ the < A > requirement in the front ^ }, { the < A > loan in the front ^ loan }, { the < A > loan in the front ^ debit }, { the < A > requirement in the front ^ debit }, { the < A > credit card in the front ^ debit in the front, the < A > loan in the front ^ credit card in the front, the < A > loan in the front ^ debit in the front, the < A > loan in the front of the < A > loan < A > line }, { the < A > loan in the front of the front, the < A > loan < A > credit card in the front, the < A > loan line } debit in the front, the < A > loan line } credit card in the front, the < A > loan line of the < A > loan < A > debit in the front, the < A > loan line }, { the < A > debit in the front, the < A > loan line of the back < A > credit card in the front, the back, the < A > loan line, the number of the < A < B < A > credit card, the front, the number of the back < A < B < A > credit card, the number of, { the earlier phase ^ loan ^ limit }, { the earlier phase ^ loan ^ credit card ^ limit }, { the earlier phase ^ requires ^ loan ^ credit card }, { the earlier phase ^ requires ^ loan ^ debit ^ limit }, { the earlier phase ^ requires ^ loan ^ credit card ^ limit }, { the earlier phase ^ requires ^ loan ^ debit ^ credit card ^ limit }, { the earlier phase ^ loan ^ credit card ^ limit }.

Taking the keyword rule { earlier stage ^ need } as an example, when the information to be detected has 'earlier stage' and 'need', the information can be determined as bad information.

According to the embodiment of the specification, similar keywords are obtained in a black and white sample based on the determined basic keywords, namely a black sample keyword set and a white sample keyword set are determined; by calculating the intersection of the black sample keyword set and the white sample keyword set, elements in the keyword set which are positioned in the black and white sample at the same time can be extracted; then, by calculating a difference set between the intersection and the black sample keyword set, the elements in the keyword set located in the black and white sample at the same time can be deleted from the black sample keyword set, and the elements in the difference set exist only in the black sample. The generated keyword rule is specially used for the black sample, so that the accuracy of the keyword rule is improved; and the efficiency is higher than that of manual work.

Although the influence of the white samples has been eliminated by the keyword rules generated by the above embodiment, the accuracy of the keyword rules is improved only for the black samples, and then the generated keyword rules are not applied, and may still not achieve high accuracy. Therefore, on the basis of the embodiment shown in fig. 1, the method may further include the following steps:

performing keyword retrieval in the black and white sample according to the keyword rule;

counting the number of black samples hit by the keyword rule and the number of white samples hit by the keyword rule;

calculating the accuracy of the keyword rule according to the hit black sample number and the hit white sample number;

deleting the keyword rule if the accuracy does not exceed a threshold.

In the embodiment, the keyword rules are searched in the black and white samples, and the accuracy of the keyword rules is calculated according to the distribution condition of the hit samples, so that the keyword rules with low accuracy can be deleted, only the keyword rules with high accuracy are reserved, and the accuracy is further improved.

For example, taking the keyword rule { prophase ^ need } as an example, assume that the number of hit black samples is N and the number of hit white samples is M; the accuracy is X ═ N/(N + M); if X reaches the threshold value, the accuracy of the keyword rule is high, and the keyword rule can be used; if X does not exceed the threshold, the accuracy of the keyword rule is not high, and the keyword rule can not be used.

Corresponding to the embodiment of the keyword rule generation method, the present specification also provides an embodiment of a keyword rule generation apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor of the device where the software implementation is located as a logical means. In terms of hardware, a hardware structure of the device in which the keyword rule generating apparatus of this specification is located may include, as shown in fig. 2, a processor, a network interface, a memory, and a nonvolatile memory, and the device in which the apparatus is located in the embodiment generally generates an actual function according to the keyword rule, and may also include other hardware, which is not described again.

Referring to fig. 3, a block diagram of a keyword rule generating apparatus according to an embodiment of the present disclosure is shown, where the apparatus includes:

a first determining unit 210 that determines a basic keyword;

a second determining unit 220 for determining a black sample keyword set and a white sample keyword set similar to the basic keyword from the black sample and the white sample according to the basic keyword;

a first calculating unit 230 that calculates an intersection of the black sample keyword set and the white sample keyword set;

a second calculating unit 240 that calculates a difference set between the intersection and the black sample keyword set;

the generating unit 250 generates a keyword rule according to the difference set and the basic keyword.

In an alternative embodiment:

the second determining unit 220 specifically includes:

the first calculating subunit calculates the similarity between the key words extracted from the black samples and the basic key words;

the second calculating subunit calculates the similarity between the key words extracted from the white samples and the basic key words;

the first determining subunit determines a preset number of keywords with the highest similarity value in the black sample as a black sample keyword set;

and the second determining subunit determines the preset number of keywords with the highest similarity value in the white samples as the white sample keyword set.

In an alternative embodiment:

the second determining unit 220 specifically includes:

the first determining subunit determines the keywords with the similarity values larger than a threshold value in the black sample as a black sample keyword set;

and the second determining subunit determines the keywords with the similarity values larger than the threshold value in the white samples as the white sample keyword set.

In an alternative embodiment:

the generating unit 250 specifically includes:

a third determining subunit, configured to determine a subset corresponding to each combination manner of the elements in the difference set;

and generating a subunit, and combining the basic keyword with each subset to obtain a keyword rule.

In an alternative embodiment:

the device further comprises:

the retrieval subunit is used for performing keyword retrieval in the black and white sample according to the keyword rule;

a counting subunit, for counting the number of black samples hit by the keyword rule and the number of white samples hit by the keyword rule;

the calculating subunit is used for calculating the accuracy of the keyword rule according to the hit black sample number and the hit white sample number;

and the deleting subunit deletes the keyword rule under the condition that the accuracy does not exceed a threshold value.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

Fig. 3 above describes the internal functional modules and the structural schematic of the keyword rule generating apparatus, and the substantial execution subject may be an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

determining a basic keyword;

Optionally, the determining, according to the basic keyword, a black sample keyword set and a white sample keyword set similar to the basic keyword from a black sample and a white sample specifically includes:

Optionally, the generating a keyword rule according to the difference set and the basic keyword specifically includes:

and combining the basic keywords with each subset to obtain a keyword rule.

Optionally, the method further includes:

deleting the keyword rule if the accuracy does not exceed a threshold.

In the above embodiments of the electronic device, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiment of the electronic device, since it is substantially similar to the embodiment of the method, the description is simple, and for the relevant points, reference may be made to part of the description of the embodiment of the method.

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

Claims

1. A method of keyword rule generation, the method comprising:

determining a basic keyword;

determining a subset corresponding to each combination mode of the elements in the difference set; and combining the basic keywords with each subset to obtain a keyword rule.

2. The method according to claim 1, wherein the determining a black sample keyword set and a white sample keyword set similar to the basic keyword from a black sample and a white sample according to the basic keyword specifically comprises:

3. The method according to claim 1, wherein the determining a black sample keyword set and a white sample keyword set similar to the basic keyword from a black sample and a white sample according to the basic keyword specifically comprises:

4. The method of claim 1, further comprising:

performing keyword retrieval in a black and white sample according to the keyword rule;

and deleting the keyword rule under the condition that the accuracy does not exceed a threshold value.

5. A keyword rule generating apparatus, the apparatus comprising:

a first determination unit that determines a basic keyword;

the generating unit is used for determining a subset corresponding to each combination mode of the elements in the difference set; and combining the basic keywords with each subset to obtain a keyword rule.

6. The apparatus according to claim 5, wherein the second determining unit specifically includes:

7. The apparatus according to claim 5, wherein the second determining unit specifically includes:

8. The apparatus of claim 5, the apparatus further comprising:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of the preceding claims 1-4.