CN113128209A

CN113128209A - Method and device for generating word stock

Info

Publication number: CN113128209A
Application number: CN202110437047.4A
Authority: CN
Inventors: 杨德将; 李原; 郝萌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-16
Anticipated expiration: 2041-04-22
Also published as: CN113128209B

Abstract

The present disclosure relates to a method and an apparatus for generating a thesaurus, an electronic device, a computer-readable storage medium, and a computer program product, and relates to the field of computers, and further relates to the field of data processing technologies. The specific implementation scheme is as follows: acquiring an initial risk word; expanding the initial risk words to obtain expanded risk words; determining keyword information based on the initial risk words and the expanded risk words; and generating a target word bank based on each keyword in the keyword information. The implementation mode can improve the accuracy of the word bank.

Description

Method and device for generating word stock

Technical Field

The present disclosure relates to the field of computers, and further relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating a thesaurus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

At present, in the insurance industry, the risk problems such as insurance fraud and the like often exist. In this regard, in order to reduce the risk, risk control is required.

In the risk control process, a risk word bank is often required to be set, and risk words in the risk word bank are matched with policy, so that the policy with risk can be screened in advance to perform risk early warning. In practice, the problem that the accuracy of the word bank is poor is found because the existing risk word bank is usually obtained by manual accumulation and entry determination.

Disclosure of Invention

The present disclosure provides a method and apparatus for generating a thesaurus, an electronic device, a computer-readable storage medium, and a computer program product.

According to a first aspect, there is provided a method for generating a thesaurus, comprising: acquiring an initial risk word; expanding the initial risk words to obtain expanded risk words; determining keyword information based on the initial risk words and the expanded risk words; and generating a target word bank based on each keyword in the keyword information.

According to a second aspect, there is provided a method for risk detection, comprising: generating a target word stock based on any one of the above methods for generating a word stock; and carrying out risk detection on the target object based on the target word bank.

According to a third aspect, there is provided an apparatus for generating a thesaurus, comprising: a risk word obtaining unit configured to obtain an initial risk word; the risk word expansion unit is configured to expand the initial risk words to obtain expanded risk words; an information determination unit configured to determine keyword information based on the initial risk word and the expanded risk word; a thesaurus generating unit configured to generate a target thesaurus based on each keyword in the keyword information.

According to a fourth aspect, a risk detection apparatus is provided, which includes the apparatus for generating a thesaurus and a risk detection unit, wherein the risk detection unit is configured to perform risk detection on a target object based on the target thesaurus generated by the apparatus for generating a thesaurus.

According to a fifth aspect, there is provided an electronic device comprising: one or more processors; a memory for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as any one of the above.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of the above.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as any one of the above.

According to the technology disclosed by the invention, the method for generating the word stock is provided, the initial risk words are expanded to obtain the expanded risk words, and the coverage of the risk words can be enhanced by adopting the expanded risk words. And then, based on the initial risk words and the expanded risk words, determining keyword information, thereby realizing further information extraction of the risk words. And then, generating a target word stock based on each keyword in the keyword information, wherein the obtained target word stock can reflect each keyword in the initial risk words and the expanded risk words, and compared with a manually set word stock, the target word stock contains more and more accurate information, and the accuracy of the word stock can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a thesaurus according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating a thesaurus according to the present disclosure;

FIG. 4 is a flow diagram of another embodiment of a method for generating a thesaurus according to the present disclosure;

FIG. 5 is a flow diagram of one embodiment of a risk detection method according to the present disclosure;

FIG. 6 is a schematic diagram illustrating one embodiment of an apparatus for generating a thesaurus according to the present disclosure;

FIG. 7 is a schematic structural diagram of one embodiment of a risk detection device according to the present disclosure;

FIG. 8 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is an exemplary system architecture diagram according to a first embodiment of the present disclosure, illustrating an exemplary system architecture 100 to which an embodiment of the method for generating a thesaurus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, and 103 may be electronic devices such as a mobile phone, a computer, and a tablet, and insurance application software may be installed in the

terminal devices

101, 102, and 103, and the insurance application software may implement insurance risk control, for example, identify users with risks and output corresponding prompts.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, televisions, smart phones, tablet computers, e-book readers, car-mounted computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, and for example, may obtain initial risk words in the

terminal devices

101, 102, and 103, obtain expanded risk words for the expanded initial risk words, determine keyword information based on the initial risk words and the expanded risk words, and generate a target word bank based on each keyword in the keyword information. Thereafter, the server 105 may receive an instruction sent by the

terminal device

101, 102, 103 to identify whether the target user is at risk, determine a risk condition based on a matching condition between the target thesaurus and the target user, and return the risk condition to the

terminal device

101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating the thesaurus provided by the embodiment of the present disclosure may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105. Accordingly, the apparatus for generating the thesaurus may be provided in the

terminal devices

101, 102, 103, or may be provided in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a thesaurus in accordance with the present disclosure is shown. The method for generating the word stock comprises the following steps:

step 201, obtaining an initial risk word.

In this embodiment, in the case where the execution subject (such as the server 105 or the

terminal devices

101, 102, 103 in fig. 1) needs to generate a thesaurus, the execution subject may first obtain the initial risk word. The source of the initial risk word may be provided by the service party in advance, or may be a risk word obtained by analyzing historical risk data, which is not limited in this embodiment. The initial risk words refer to risk-existing words that need to be paid attention to in a wind control link, and the number of the initial risk words may be one or multiple, which is not limited in this embodiment. In the wind control link of the insurance industry, the initial risk word can be a word causing risk to a specified insurance, such as a certain disease type, which is not suitable for the specified insurance, and the disease type is considered as a risk word at this moment. In the wind control segment of the advertising industry, the initial risk words may be words that are not allowed to appear in the advertisement. If the source of the initial risk word is a word packet provided by the business side in advance, the executive body can carry out semantic recognition on the word packet so as to comb the word packet. For example, each initial risk word is mapped with a corresponding risk category, and the risk category and the initial risk word are mapped into a JSON (JavaScript Object Notation)) file.

And step 202, expanding the initial risk words to obtain expanded risk words.

In this embodiment, after the execution main body obtains the initial risk word, the execution main body may expand the initial risk word to obtain an expanded risk word. Wherein, the expanded risk word refers to a risk word having an association relation with the initial risk word. For the determination of the expanded risk words, the execution subject may determine, based on semantic association with the initial risk words, the risk words whose edit distance from the initial risk words is smaller than a threshold value as the expanded risk words. Or, a corresponding risk word expansion database may be preset, and the execution subject directly searches the risk word corresponding to the initial risk word in the risk word expansion database as an expansion risk word. The setting of the risk word expansion database can be obtained by analyzing historical risk data.

Step 203, determining keyword information based on the initial risk words and the expanded risk words.

In this embodiment, the execution main body can determine a keyword in the initial risk word and the expanded risk word, and then split the keyword into a plurality of keywords to form keyword information. The determination of the keyword may depend on a preset determination rule. For example, "disease a" can be determined as a keyword for "mild disease a", "moderate disease a", and "severe disease a". Each word in the keywords is a keyword. Optionally, determining the keyword information based on the initial risk word and the expanded risk word may include: inputting the initial risk words and the expanded risk words into a preset keyword determination model to obtain a plurality of corresponding keywords; determining to obtain a plurality of keywords based on each word in the plurality of keywords; based on a number of keywords, keyword information is determined. Further optionally, the execution main body may sort each word of the plurality of keywords to obtain keyword information.

And step 204, generating a target word bank based on each keyword in the keyword information.

In this embodiment, each keyword in the keyword information may be sorted according to semantics, so that the keywords related before and after may jointly form a corresponding word. And the keyword information can also comprise word segmentation marks which are used for dividing each keyword in the keyword information into corresponding phrases. The execution main body can obtain a plurality of phrases by identifying word segmentation marks in the keyword information and dividing each keyword in the keyword information based on the word segmentation marks to form a target word stock.

For example, in a scenario where the insurance industry is windaged, the executive agent may analyze historical risk data, such as historical risky user data, to obtain an initial risk word corresponding to a specified insurance. The execution principal may then expand the initial risk word. Specifically, the execution subject may analyze historical user data with risks in advance, and determine the frequency of occurrence of risks corresponding to each initial risk word; and carrying out semantic association on the initial risk words with the risk occurrence frequency in a specified number from high to low to obtain the expanded risk words. The semantic association can be realized by some existing technical means, and is not described herein again. Then, the execution main body can determine keywords in the initial risk words and the expanded risk words, divide the keywords into keywords, and perform different types of permutation and combination on the keywords to obtain keyword information, wherein the keyword information includes a keyword permutation sequence corresponding to each type. The execution main body can also traverse various types of keyword arrangement sequences, based on semantic analysis, word segmentation marks are marked on the positions of the specified keywords in each keyword arrangement sequence, and word segmentation marks can be marked on the positions corresponding to the last keywords capable of forming words. Then, the execution main body can divide the keywords in each keyword arrangement sequence based on each word segmentation mark to form a plurality of words and generate a target word stock.

With continued reference to fig. 3, a schematic diagram of one application scenario of a method for generating a thesaurus according to the present disclosure is shown. In the application scenario of fig. 3, the executive body may be used for risk control in the insurance industry. The executing subject may first obtain an initial risk word 301 corresponding to the designated insurance, and the initial risk word 301 may include "disease a" and "cheat insurance". After that, the executing subject may expand the initial risk word 301 to obtain an expanded risk word 302, and the expanded risk word 302 may include "mild disease a" and "moderate disease a" obtained by expanding "disease a", and may also include "insurance fraud" and "work injury fraud" obtained by expanding "fraud". The executive may then determine keywords such as "disease a" and "cheat insurance", "work fraud insurance" and "insurance fraud" in the initial risk word 301 and the expanded risk word 302. Then, the execution subject divides the keyword into words to obtain keyword information 303, and the keyword information 303 includes "disease", "illness", "a", "cheat", "insurance", "worker", "injury", "insurance", and "danger". The target word stock 304 can be obtained by permutation and combination based on each keyword, and the target word stock 304 can include "disease a", "fraud insurance", "disease a work injury fraud insurance", and the like.

According to the method for generating the word stock, provided by the embodiment of the disclosure, the initial risk words are expanded to obtain the expanded risk words, and the coverage of the risk words can be enhanced by adopting the expanded risk words. And then, based on the initial risk words and the expanded risk words, determining keyword information, thereby realizing further information extraction of the risk words. And then, generating a target word stock based on each keyword in the keyword information, wherein the obtained target word stock can reflect each keyword in the initial risk words and the expanded risk words, and compared with a manually set word stock, the target word stock contains more and more accurate information, and the accuracy of the word stock can be improved.

With continued reference to fig. 4, a flow 400 of another embodiment of a method for generating a thesaurus according to the present disclosure is shown. As shown in fig. 4, the method for generating a thesaurus of the present embodiment may include the following steps:

step 401, obtaining an initial risk word.

In this embodiment, please refer to the detailed description of step 201 for the detailed description of step 401, which is not repeated herein.

Step 402, determining the editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word bank.

In this embodiment, the preset candidate extended word library is a word library determined in advance by means of human setting, data crawling, and the like, and may include a large number of risk words. After the execution main body obtains the initial risk words, the initial classification of the initial risk words can be determined, then a plurality of candidate expansion words are screened from a preset candidate expansion word library, and then the editing distance and/or semantic similarity between the candidate expansion words and the initial risk words are determined. The editing distance refers to the number of times of processing required for conversion between two words, and the semantic similarity refers to the semantic similarity between the two words obtained based on semantic analysis. The shorter the editing distance and the higher the semantic similarity, the stronger the association between the candidate expansion word and the initial risk word.

In some optional implementation manners of this embodiment, the preset candidate extended word library is determined by the following steps: acquiring historical risk information; and determining a preset candidate expansion word bank based on the occurrence frequency of each risk word in the historical risk information.

In this implementation, the historical risk information may be historical risk information, such as user corpora that have historically experienced risk. The execution main body can perform word segmentation on the historical risk information to obtain a plurality of risk words appearing in the historical risk information. And adding a preset number of words with high occurrence frequency of the risk words in the historical risk information into a word bank to obtain a preset candidate expansion word bank. And if the risk words with the first three occurrence frequencies in the historical risk information are added into the word stock, obtaining a preset candidate expansion word stock.

And step 403, determining an expansion risk word in each candidate expansion word based on the editing distance and/or the semantic similarity.

In this embodiment, the shorter the editing distance between the candidate expansion word and the initial risk word is, the higher the semantic similarity is, which indicates that the association between the candidate expansion word and the initial risk word is stronger. Therefore, a preset number of candidate expansion words can be selected as expansion risk words according to the sequence that the editing distance is from short to long and/or the semantic similarity is from high to low.

Step 404, determining keyword information based on the initial risk words and the expanded risk words.

In this embodiment, please refer to the description of step 203 for the description of step 404, which is not repeated herein.

Step 405, determining a keyword set based on each keyword in the keyword information and a preset keyword sequence.

In this embodiment, since each keyword in the keyword information is obtained by splitting the keywords of the initial risk word and the expanded risk word, the preset keyword sequence may be a position sequence of the keyword in the keyword. It should be noted that, for the case that there are more than two keywords in the initial risk word and the expanded risk word, such as "work wound" and "cheat insurance" in "work wound cheat insurance", the keywords may be separated respectively to obtain the keywords corresponding to each keyword, and when the keywords corresponding to different keywords are ranked, different types of rankings may be included, such as "work wound cheat insurance" and "cheat insurance work wound". The execution main body may sort each keyword in the keyword information according to a preset keyword order to obtain a keyword set.

In some optional implementations of this embodiment, determining the keyword set based on each keyword in the keyword information and a preset keyword order includes: traversing each keyword in the keyword information in the initial dictionary tree according to a preset keyword sequence; for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in the initial dictionary tree in an associated manner according to a preset keyword sequence to obtain a target dictionary tree; based on each keyword in the target dictionary tree, a set of keywords is determined.

In this implementation, the execution subject may generate the target thesaurus by building a dictionary tree. Wherein, the dictionary tree refers to a tree-shaped storage structure, and each node except the root node only comprises one character. Each key in the key information may be stored in a respective node of the dictionary tree other than the root node. Specifically, the execution body may traverse each keyword in the initial dictionary tree according to a preset keyword sequence, that is, a sequence of multiple types of keywords. The initial dictionary tree refers to a dictionary tree to which data needs to be written, and the execution subject can directly read the initial dictionary tree from the memory. In the process of traversing the keywords, if the current keyword is not stored in advance, the current keyword and the next keyword are stored in an associated manner according to a key-value pair. The keywords which are not stored in advance refer to the keywords which are not stored in the initial dictionary tree in advance in the keyword information. The next keyword is determined according to a preset keyword order. And if the current keyword is stored in advance, determining a value corresponding to the current keyword, traversing the next keyword based on the value corresponding to the current keyword until the traversal is finished, and obtaining the target dictionary tree. The pre-stored keywords refer to keywords pre-stored in the initial dictionary tree in the keyword information. This process may store each keyword in the keyword information to the target dictionary tree. Optionally, in a case that it is determined that the current keyword is the last word corresponding to the keyword, a word segmentation mark may be added to the storage location of the current keyword. Further optionally, after adding the word segmentation markers to the storage locations of the current keywords, corresponding risk categories may also be added.

Step 406, determining the position of the participle in the keyword set.

In this embodiment, the executing entity adds a word segmentation mark to the position of the last word of the keyword. The execution subject may determine a position of the segmentation mark as a segmentation position in the keyword set by recognizing the segmentation mark. Wherein the segmentation position is used for dividing the keyword into different words. In addition to adding a word segmentation mark based on the position of the last word of the keyword and determining the word segmentation position, optionally, determining the word segmentation position in the keyword set may further include: various keyword combinations in the keyword set are identified, whether the keyword combinations are normal words or not is judged, and if yes, word segmentation marks can be added on the basis of the keyword combinations. For example, a keyword corresponds to a plurality of keywords, and one keyword can be used as an abbreviation of the keyword to indicate the meaning of the keyword. The keyword itself can be used as a keyword combination, word segmentation marks are added at the front and rear positions of the keyword, and the word segmentation position is determined based on the word segmentation marks.

Step 407, dividing the keyword set into at least one target word based on the word segmentation position.

In this embodiment, since the word segmentation positions are used to describe the segmentation positions of the keywords, the keywords between any two word segmentation positions can be sequentially combined into one target word by splitting each word segmentation position, so as to obtain at least one target word.

Step 408, generating a target word bank based on the at least one target word.

In this embodiment, at least one target word and the risk category corresponding to each target word may be stored correspondingly, so as to generate a target word bank.

In some optional implementations of this embodiment, generating the target word library based on the at least one target word includes: for each target word in at least one target word, determining a risk category corresponding to the target word; and generating a target word bank based on at least one target word and the risk category corresponding to each target word.

In the implementation mode, the risk category corresponding to the target word can be preset at the position of the word segmentation mark, the risk category can be set based on semantic classification determination, the semantics of the target word are analyzed, and the category matched with the target word is determined. In the process of generating the target word stock, the target words and the risk categories can be correspondingly stored. For example, the risk categories may include a fraud risk, a fraud protection risk, and the like, and in the process of generating the target lexicon, each target word corresponding to the fraud risk and the fraud risk may be stored in an associated manner, and each target word corresponding to the fraud protection risk and the fraud protection risk may be stored in an associated manner. According to the process, when risk detection is carried out based on the target word bank, the risk category corresponding to the risk word can be quickly found, and risk early warning is conveniently output in a targeted mode.

Step 409, storing a snapshot of the target thesaurus.

In this embodiment, the execution subject may store a snapshot of the target thesaurus for loading the target thesaurus based on the snapshot.

At step 410, a positive set of risk samples and/or a negative set of risk samples are obtained.

In this embodiment, the risk positive sample set is a sample set with risk, and the risk negative sample set is a sample set without risk. For example, the risk positive sample set may be a set of risk users, and the risk negative sample set may be a set of non-risk users.

And 411, updating the target word bank based on the risk positive sample set and/or the risk negative sample set.

In this embodiment, for the risk positive sample set, the risk positive samples may be sequentially traversed, and the risk positive samples are matched with the words in the target word bank to obtain the classification of the risk positive samples, such as risk existence or risk nonexistence. If the derived classification is risk, then the traversal continues. And if the obtained classification is that no risk exists, updating the target risk words in the risk positive sample into the target word bank. For example, a preset number of target risk words with high occurrence frequency to low occurrence frequency in the risk positive sample are updated into the target word bank. For the risk negative sample set, the preset number of risk words with the high occurrence frequency to the low occurrence frequency in each risk negative sample can be determined, whether the risk words are in the target word stock or not is judged, and if the risk words are in the target word stock, the risk words are deleted, so that the target word stock is updated.

The method for generating the word stock provided by the above embodiment of the present disclosure may further sequence the keywords according to a preset keyword sequence to obtain a keyword set, and then determine the word segmentation position based on the keyword segmentation position corresponding to the keyword and the semantic association condition of the keyword combination. And then, at least one target word is obtained based on word segmentation position division. The target word library generated by the target words can cover more types of words, such as abbreviations of the keywords, different position combination words of the keywords, each keyword and the like, so that the coverage of the target word library is improved. In addition, the target words and the risk categories can be correspondingly stored, the target words can be conveniently and directly determined according to the risk categories, and the determining and searching efficiency of the risk words is higher. And the editing distance and the semantic similarity can be used as a basis for determining the expansion risk words, so that more accurate expansion risk words can be determined. And the occurrence frequency of each risk word in the historical risk information is further combined to serve as a determination basis for the expansion risk words, so that the expansion risk words are more accurate, and the word bank content is richer. And the supervised target word bank updating can be realized based on the risk positive sample set and/or the risk negative sample set, so that the real-time performance of the target word bank is improved, and the accuracy is higher. In addition, the snapshot of the target word stock is stored, so that the snapshot can be conveniently and directly read to update the target word stock, and the word stock is more convenient to update.

With continued reference to fig. 5, a flow 500 of one embodiment of a risk detection method according to the present disclosure is shown. As shown in fig. 5, the risk detection method of the present embodiment may include the following steps:

step 501, generating a target word stock.

In the present embodiment, the execution body may generate the target thesaurus based on the method for generating a thesaurus described above.

And 502, carrying out risk detection on the target object based on the target word stock.

In this embodiment, the target object may be a document in various forms such as a text, an image, and the like, for example, a policy text of the target user. The target object is subjected to text analysis, image recognition and other processing, a plurality of keywords contained in the target object are recognized, and the probability of risk of the target object can be determined by matching the plurality of keywords contained in the target object with the target word bank. If the matching degree of a plurality of keywords contained in the target object and each word in the target word bank is higher, the probability of risk is higher, and therefore risk early warning is achieved.

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating a thesaurus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various servers.

As shown in fig. 6, the apparatus 600 for generating a thesaurus of the present embodiment includes: a risk word acquisition unit 601, a risk word expansion unit 602, an information determination unit 603, and a thesaurus generation unit 604.

A risk word obtaining unit 601 configured to obtain an initial risk word.

And a risk word expansion unit 602 configured to expand the initial risk word to obtain an expanded risk word.

An information determining unit 603 configured to determine keyword information based on the initial risk word and the expanded risk word.

A thesaurus generating unit 604 configured to generate a target thesaurus based on each keyword in the keyword information.

In some optional implementations of the present embodiment, the thesaurus generating unit 604 is further configured to: determining a keyword set based on each keyword in the keyword information and a preset keyword sequence; determining word segmentation positions in the keyword set; dividing the keyword set into at least one target word based on the word segmentation position; and generating a target word bank based on the at least one target word.

In some optional implementations of the present embodiment, the thesaurus generating unit 604 is further configured to: traversing each keyword in the keyword information in the initial dictionary tree according to a preset keyword sequence; for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in the initial dictionary tree in an associated manner according to a preset keyword sequence to obtain a target dictionary tree; based on each keyword in the target dictionary tree, a set of keywords is determined.

In some optional implementations of the present embodiment, the thesaurus generating unit 604 is further configured to: for each target word in at least one target word, determining a risk category corresponding to the target word; and generating a target word bank based on at least one target word and the risk category corresponding to each target word.

In some optional implementations of this embodiment, the risk word expansion unit 602 is further configured to: determining the editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word bank; and determining the expansion risk word in each candidate expansion word based on the editing distance and/or the semantic similarity.

In some optional implementations of this embodiment, the risk word expansion unit 602 is further configured to: acquiring historical risk information; and determining a preset candidate expansion word bank based on the occurrence frequency of each risk word in the historical risk information.

In some optional implementations of this embodiment, the apparatus further includes: a set acquiring unit configured to acquire a risk positive sample set and/or a risk negative sample set; and the word bank updating unit is configured to update the target word bank based on the risk positive sample set and/or the risk negative sample set.

In some optional implementations of this embodiment, the apparatus further includes: and the snapshot storage unit is configured to be a snapshot of the target word bank.

It should be understood that units 601 to 604 recited in the apparatus 600 for generating a thesaurus correspond to respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above with respect to the method for generating a thesaurus are equally applicable to the apparatus 600 and the units included therein and will not be described in detail here.

With continuing reference to fig. 7, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a risk detection apparatus, which corresponds to the embodiment of the method shown in fig. 5, and which is particularly applicable to various servers. The risk detection apparatus 700 of the present embodiment includes the apparatus 600 for generating a thesaurus and a risk detection unit 701; wherein the content of the first and second substances,

a risk detection unit 701 configured to perform risk detection on the target object based on the target lexicon generated by the apparatus for generating a lexicon 500.

In this embodiment, the risk detection unit 701 corresponds to step 502, and the operations and features described above with respect to step 502 are also applicable to the risk detection unit 701, and are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure. The electronic equipment for the method for generating the word stock comprises the following steps: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to implement the method for generating a thesaurus or the risk detection method described above. The readable storage medium stores computer instructions for causing a computer to execute the above-described method for generating a thesaurus or risk detection method. The computer program product comprises a computer program which, when executed by a processor, implements the above-described method for generating a thesaurus or risk detection method.

Fig. 8 shows a block diagram of an electronic device 800 for implementing a method for generating a thesaurus according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the device 800 includes a processor 801 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The processor 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 801 may be various general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of processor 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 801 performs the various methods and processes described above, such as the method for generating a thesaurus. For example, in some embodiments, the method for generating a thesaurus may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM802 and/or communications unit 809. When loaded into RAM803 and executed by the processor 801, a computer program may perform one or more steps of the method for generating a thesaurus described above. Alternatively, in other embodiments, the processor 801 may be configured to perform the method for generating the thesaurus by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for generating a thesaurus, comprising:

acquiring an initial risk word;

expanding the initial risk words to obtain expanded risk words;

determining keyword information based on the initial risk words and the expanded risk words;

and generating a target word bank based on each keyword in the keyword information.

2. The method of claim 1, wherein the generating a target thesaurus based on each keyword in the keyword information comprises:

determining a keyword set based on each keyword in the keyword information and a preset keyword sequence;

determining word segmentation positions in the keyword set;

dividing the keyword set into at least one target word based on the word segmentation position;

and generating the target word bank based on the at least one target word.

3. The method of claim 2, wherein the determining a keyword set based on each keyword in the keyword information and a preset keyword order comprises:

traversing each keyword in the keyword information in the initial dictionary tree according to the preset keyword sequence;

for each keyword which is not stored in advance, storing the keyword and the next keyword of the keyword in the initial dictionary tree in an associated manner according to the preset keyword sequence to obtain a target dictionary tree;

determining the keyword set based on each keyword in the target dictionary tree.

4. The method of claim 2, wherein the generating the target thesaurus based on the at least one target word comprises:

for each target word in the at least one target word, determining a risk category corresponding to the target word;

and generating the target word bank based on the at least one target word and the risk category corresponding to each target word.

5. The method of claim 1, wherein the expanding the initial risk word to obtain an expanded risk word comprises:

determining the editing distance and/or semantic similarity between each candidate expansion word and the initial risk word in a preset candidate expansion word bank;

and determining the expansion risk word in the candidate expansion words based on the editing distance and/or the semantic similarity.

6. The method of claim 5, wherein the preset candidate expansion word bank is determined by the following steps:

acquiring historical risk information;

and determining the preset candidate expansion word bank based on the occurrence frequency of each risk word in the historical risk information.

7. The method of claim 1, wherein the method further comprises:

acquiring a risk positive sample set and/or a risk negative sample set;

updating the target thesaurus based on the risk positive sample set and/or the risk negative sample set.

8. The method of any of claims 1 to 7, wherein the method further comprises:

and storing the snapshot of the target word bank.

9. A method of risk detection, wherein the method comprises:

generating a target thesaurus based on the method of any one of claims 1-8;

and carrying out risk detection on the target object based on the target word bank.

10. An apparatus for generating a thesaurus, comprising:

a risk word obtaining unit configured to obtain an initial risk word;

the risk word expansion unit is configured to expand the initial risk words to obtain expanded risk words;

an information determination unit configured to determine keyword information based on the initial risk word and the extended risk word;

a thesaurus generating unit configured to generate a target thesaurus based on each keyword in the keyword information.

11. The apparatus of claim 10, wherein the thesaurus generation unit is further configured to:

determining word segmentation positions in the keyword set;

and generating the target word bank based on the at least one target word.

12. The apparatus of claim 11, wherein the thesaurus generation unit is further configured to:

13. The apparatus of claim 11, wherein the thesaurus generation unit is further configured to:

14. The apparatus of claim 10, wherein the risk word expansion unit is further configured to:

15. The apparatus of claim 14, wherein the risk word expansion unit is further configured to:

acquiring historical risk information;

16. The apparatus of claim 10, wherein the apparatus further comprises:

a set acquiring unit configured to acquire a risk positive sample set and/or a risk negative sample set;

a thesaurus updating unit configured to update the target thesaurus based on the risk positive sample set and/or the risk negative sample set.

17. The apparatus of any of claims 10 to 16, wherein the apparatus further comprises:

a snapshot storage unit configured to be a snapshot of the target thesaurus.

18. A risk detection device, wherein the device comprises the device for generating a thesaurus according to any one of the above 10-17 and a risk detection unit;

the risk detection unit is configured to carry out risk detection on the target object based on the target word stock generated by the device for generating the word stock.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.