CN114330331B - Method and device for determining importance of word segmentation in link - Google Patents

Method and device for determining importance of word segmentation in link Download PDF

Info

Publication number
CN114330331B
CN114330331B CN202111616516.5A CN202111616516A CN114330331B CN 114330331 B CN114330331 B CN 114330331B CN 202111616516 A CN202111616516 A CN 202111616516A CN 114330331 B CN114330331 B CN 114330331B
Authority
CN
China
Prior art keywords
participle
participles
link
neighborhood information
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111616516.5A
Other languages
Chinese (zh)
Other versions
CN114330331A (en
Inventor
李雪莹
鲍青波
万卉
张楠
王煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202111616516.5A priority Critical patent/CN114330331B/en
Publication of CN114330331A publication Critical patent/CN114330331A/en
Application granted granted Critical
Publication of CN114330331B publication Critical patent/CN114330331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a method and a device for determining the importance of word segmentation in a link, wherein the method comprises the steps of carrying out word segmentation processing on a target link text to obtain a word segmentation sequence; generating neighborhood information corresponding to each participle according to the participles, wherein the neighborhood information corresponding to each participle is formed by the participles and N sequenced participles before and after the participles are sequenced; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle; generating a plurality of updated link texts corresponding to each participle according to the combination of each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the updated link texts comprise updated link texts with participles and updated link texts without participles; and carrying out malicious link detection on a plurality of updated link texts corresponding to each participle, and determining the importance of each participle in the target link text according to the detection result.

Description

Method and device for determining importance of word segmentation in link
Technical Field
The application relates to the technical field of network security, in particular to a method and a device for determining word segmentation importance in a link.
Background
With the development of internet technology, people also encounter some problems when using a network, for example, when browsing a web page, the whole screen is full of continuous web pages, or when browsing information on the network, a web page link is clicked, and as a result, viruses or a plurality of rogue software are installed, and these links reduce the internet experience of users and bring certain security risks to users.
In detecting malicious links, in addition to the way in which characteristic strings are generally matched, more and more machine learning algorithms are also applied to the detection of malicious links. For example, a logistic regression algorithm or a natural language processing algorithm and the like are greatly improved in detection accuracy compared with feature matching, but due to the black box of the deep model, for detecting malicious links, which part of features of the links are main features of the malicious links (the neural network model only gives results of malicious and non-malicious probabilities), and further characteristics of an attacker cannot be analyzed to give a targeted alarm.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method and an apparatus for determining importance of word segmentation in a link, so as to solve the above problem.
In a first aspect, the present invention provides a method for determining importance of word segmentation in a link, including: performing word segmentation processing on the target link text to obtain a word segmentation sequence, wherein the word segmentation sequence comprises a plurality of words which are sequentially sequenced; generating neighborhood information corresponding to each participle according to the participles, wherein the neighborhood information corresponding to each participle is formed by the preceding N participles and the subsequent N participles of the participles and the participles; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle, wherein the residual neighborhood information corresponding to each participle is obtained by deleting preset participles in the corresponding neighborhood information; generating a plurality of updated link texts corresponding to each participle according to the combination of each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the plurality of updated link texts comprise link texts with participles and updated link texts without participles; and carrying out malicious link detection on a plurality of updated link texts corresponding to each participle, and determining the importance of the participle in the target link text according to the detection result.
According to the method for determining the word segmentation importance in the link, word segmentation processing is carried out on a target link text to obtain a word segmentation sequence, then neighborhood information of each word segmentation is generated according to a plurality of words of the word segmentation sequence, a plurality of residual neighborhood information corresponding to each word segmentation is generated according to the neighborhood information corresponding to each word segmentation, an updated link text with the word segmentation and an updated link text without the word segmentation corresponding to each word segmentation are generated according to each residual neighborhood information and other words except the corresponding neighborhood information, malicious link detection is carried out on the updated link texts corresponding to each word segmentation, and the word segmentation importance in the target link text is determined according to a detection result. The neighborhood information generated by the scheme can represent the context characteristics of the participle, so that the obtained updated link text considers the context information of each participle, and the obtained updated link texts comprise the updated link text with the participle and the updated link text without the participle, so that the important degree of the participle on the target link text can be known after the malicious link detection is carried out on the updated link text, and further, the link malicious characteristics are automatically extracted and explained, if the target link text belongs to the malicious link, the important malicious characteristics in the malicious link can be obtained through the scheme, and the characteristic attack technique of an attacker can be analyzed.
In an optional implementation manner of the first aspect, the performing malicious link detection on a plurality of updated link texts corresponding to each participle, and determining an importance of the participle in the target link text according to a detection result includes: carrying out malicious link detection on the plurality of updated link texts corresponding to each participle to obtain an updated detection result of each updated link text in the plurality of updated link texts corresponding to each participle; calculating a first number of updated link texts with the participles, which are the same as the original detection result of the target link text; calculating a second number of updated links without word segmentation, which is different from the original detection result of the target link text; and calculating the importance of the corresponding part words to the target link text according to the first quantity and the second quantity.
In an optional implementation manner of the first aspect, performing malicious link detection on a plurality of updated link texts corresponding to each participle to obtain an update detection result of each updated link text in the plurality of updated link texts corresponding to each participle includes: and inputting each updated link text corresponding to each participle into a preset malicious link detection model, and obtaining an updated detection result of each updated link text in a plurality of updated link texts corresponding to each participle output by the malicious link detection model.
In an optional implementation manner of the first aspect, before calculating the first number of updated link texts with participles which is the same as the original detection result of the target link text, the method further comprises: and inputting the target link text into a preset malicious link detection model, obtaining a detection result of the target link text output by the malicious link detection model, and obtaining an original detection result.
In an optional implementation manner of the first aspect, generating neighborhood information corresponding to each participle according to a plurality of participles includes: and aiming at each participle in the participle sequence, extracting the participle, N participles before the participle is sequenced and N participles after the participle is sequenced to form neighborhood information corresponding to the participle, and obtaining the neighborhood information corresponding to each participle in the plurality of participles.
In an alternative embodiment of the first aspect, the method further comprises: if the number of the word segmentations before or after the word segmentation sorting is less than N, calculating the difference value between the number of the word segmentations before or after the word segmentation sorting and N; adding the preset characters with the number difference number before or after the sorting of the participles.
In an optional implementation manner of the first aspect, generating, according to the neighborhood information corresponding to each participle, a plurality of remaining neighborhood information corresponding to each participle includes: and randomly deleting the corresponding number of participles in the neighborhood information according to the preset multiple deletion numbers aiming at the neighborhood information of each participle to obtain multiple residual neighborhood information of the participle.
In an optional implementation manner of the first aspect, after determining the importance of the word in the target link text according to the detection result, the method further includes: and sequencing the multiple participles according to the importance degrees of the multiple participles from large to small, and displaying the multiple participles with the importance degrees sequenced from large to small.
In a second aspect, the present invention provides an apparatus for determining importance of a word segmentation in a link, the apparatus comprising: the word segmentation processing module is used for carrying out word segmentation processing on the target link text to obtain a word segmentation sequence, and the word segmentation sequence comprises a plurality of words which are sequentially sequenced; the generating module is used for generating neighborhood information corresponding to each participle according to the participles, wherein the neighborhood information corresponding to each participle is formed by the participles and N participles before and N participles after the participles are sequenced; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle, wherein the residual neighborhood information corresponding to each participle is obtained by deleting preset participles in the corresponding neighborhood information; generating a plurality of updated link texts corresponding to each participle according to each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the updated link texts comprise link texts with the participles and updated link texts without the participles; and the detection determining module is used for carrying out malicious link detection on the plurality of updated link texts corresponding to each participle and determining the participle importance degree in the target link text according to the detection result.
According to the device for determining the importance of the participles in the link, word segmentation processing is firstly carried out on a target link text to obtain a participle sequence, then neighborhood information of each participle is generated according to a plurality of participles of the participle sequence, a plurality of residual neighborhood information corresponding to each participle is generated according to the neighborhood information corresponding to each participle, an updated link text with the participle and an updated link text without the participle corresponding to each participle are generated according to each residual neighborhood information and other participles except the corresponding neighborhood information, so that malicious link detection is carried out on the updated link texts corresponding to each participle, and the importance of the participles in the target link text is determined according to a detection result. The neighborhood information generated by the scheme can represent the context characteristics of the participle, so that the obtained updated link text considers the context information of each participle, and the obtained updated link texts comprise the updated link text with the participle and the updated link text without the participle, so that the importance degree of the participle on the target link text can be known after malicious link detection is carried out on the updated link text, and further, the link malicious characteristics are automatically extracted and explained.
In an optional implementation manner of the second aspect, the detection determining module is specifically configured to perform malicious link detection on a plurality of updated link texts corresponding to each participle, and obtain an update detection result of each updated link text in the plurality of updated link texts corresponding to each participle; calculating a first number of updated link texts with the participles, which are the same as the original detection result of the target link text; calculating a second number of updated links without word segmentation, which is different from the original detection result of the target link text; and calculating the importance of the corresponding part words to the target link text according to the first quantity and the second quantity.
In an optional implementation manner of the second aspect, the detection determining module is further specifically configured to input each updated link text corresponding to each participle into a preset malicious link detection model, and obtain an updated detection result of each updated link text in a plurality of updated link texts corresponding to each participle output by the malicious link detection model.
In an optional implementation manner of the second aspect, the detection determining module is further configured to input the target link text into a preset malicious link detection model, obtain a detection result of the target link text output by the malicious link detection model, and obtain an original detection result.
In an optional implementation manner of the second aspect, the generating module is specifically configured to, for each participle in the participle sequence, extract N participles before and N sequenced participles to form neighborhood information corresponding to the participle, and obtain the neighborhood information corresponding to each participle in the multiple participles.
In an optional implementation manner of the second aspect, the apparatus further includes a calculating module, configured to calculate a difference between the number of the pre-ranked or post-ranked participles and the number of N if the number of the pre-ranked or post-ranked participles is less than N; and the adding module is used for adding the preset characters with the quantity difference number before or after the sorting of the participles.
In an optional implementation manner of the second aspect, the generating module is further specifically configured to randomly delete a corresponding number of segmented words from the neighborhood information according to a preset multiple deletion numbers for the neighborhood information of each segmented word, and obtain multiple remaining neighborhood information of the segmented words.
In an optional implementation manner of the second aspect, the apparatus further includes a presentation module, configured to rank the importance of the multiple participles from high to low, and present the multiple participles ranked from high to low.
In a third aspect, the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to perform the method in the first aspect or any optional implementation manner of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the method of the first aspect or any optional implementation manner of the first aspect.
In a fifth aspect, the present application provides a computer program product, which when run on a computer, causes the computer to perform the method of the first aspect, any optional implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a first flowchart of a method for determining importance of a word segmentation in a link according to an embodiment of the present application;
fig. 2 is a second flowchart of a method for determining importance of a word segmentation in a link according to an embodiment of the present application;
fig. 3 is a schematic diagram of neighborhood information provided in the embodiment of the present application;
fig. 4 is a third flowchart of a method for determining importance of word segmentation in a link according to an embodiment of the present application;
fig. 5 is a fourth flowchart of a method for determining importance of word segmentation in a link according to an embodiment of the present application;
fig. 6 is a fifth flowchart of a method for determining importance of word segmentation in a link according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a device for determining importance of word segmentation in a link according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
An icon: 700-word segmentation processing module; 710-a generation module; 720-detection determination module; 730-a calculation module; 740-an add module; 750-a display module; 8-an electronic device; 801-a processor; 802-a memory; 803 — communication bus.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The embodiment of the application provides a method for determining importance of word segmentation in a link, which can determine importance of word segmentation in the link, so that malicious features in a malicious link can be analyzed and extracted and explained under the condition that the link is a malicious link, and the problem that main features of the malicious link cannot be effectively known due to the black box of a malicious link model is solved. As shown in fig. 1, the method for determining importance of word segmentation in link according to the present scheme includes:
step S100: and performing word segmentation processing on the target link text to obtain a word segmentation sequence.
Step S110: and generating neighborhood information corresponding to each participle according to the plurality of participles.
Step S120: and generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle.
Step S130: and generating a plurality of updated link texts corresponding to each participle according to the combination of each residual neighborhood information and other participles except the corresponding neighborhood information.
Step S140: and carrying out malicious link detection on a plurality of updated link texts corresponding to each word segmentation, and determining the importance of the word segmentation in the target link text according to the detection result.
In step S100, the present solution may acquire a target link text, where the target link text may be a link text based on a user operation input, such as Uniform Resource Locator (URL) link text data; the segmentation sequence may include a plurality of segments ordered sequentially, wherein the sequential ordering refers to an order of the original linked texts sequentially arranged from left to right. As a possible implementation manner, the word segmentation process performed in step S100 may use each word separated by the "/" character in the target link text as a segmentation word, so as to obtain a segmentation word sequence, for example, a URL link text is http:// W 1 /W 2 /W 3 /W 4 /……/W n Then, the resulting sequence of participles is (W) 1 ,W 2 ,…,W n )。
As a possible implementation mode, the target link text may have codes, and the scheme can also perform decoding processing on the target link text before performing word segmentation processing on the target link text.
After the word segmentation sequence is obtained in the above manner, in the present scheme, in step S110, neighborhood information corresponding to each word segmentation is generated according to a plurality of word segmentation, where the neighborhood information corresponding to each word segmentation is formed by N word segmentation before the word segmentation and N word segmentation after the word segmentation, and since the neighborhood information of each word segmentation includes N word segmentation before the word segmentation and N word segmentation after the word segmentation, the neighborhood information may represent context information of the word segmentation, specifically, as shown in fig. 2, the neighborhood information corresponding to each word segmentation may be obtained in the following manner, including:
step S200: and aiming at each participle in the participle sequence, extracting the participle, N participles before the participle is sequenced and N participles after the participle is sequenced to form neighborhood information corresponding to the participle, and obtaining the neighborhood information corresponding to each participle in the plurality of participles.
In the above embodiment, for each participle, the method extracts the participle, N participles ranked before the participle, and N participles ranked after the participle to form neighborhood information corresponding to the participle. As an example of FIG. 3, taking N as 3, the word segmentation W in the URL link text 4 And word segmentation W 4 The first 3 words W 1 W 2 、W 3 And the last 3 words W 5 W 6 、W 7 The context information that constitutes the word, i.e. the neighborhood information, contains 7 tokens in the neighborhood.
It should be noted that, in the above neighborhood information scheme for generating participles, the number of some participles before or after sorting is less than N, in such a case, the scheme first calculates the difference between the number of the participles before or after sorting and the number of N, and then adds the preset characters of the number difference before or after sorting of the participles. For example, taking N as an example, when the number of the segmented words before the word segmentation is sorted is less than 3, the present scheme first calculates the difference between the number of the segmented words before the word segmentation is sorted and 3, for example, 2, then the present scheme adds 2 preset characters before the word segmentation is sorted, so that the number of the segmented words included in the neighborhood information of each segmented word is the same.
After the neighborhood information of each participle is obtained in the above manner, in the present scheme, step S120 is executed to generate a plurality of remaining neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle. As a possible implementation manner, as shown in fig. 4, the present solution may generate a plurality of remaining neighborhood information corresponding to each participle by the following manners, including:
step S400: and randomly deleting the corresponding number of participles in the neighborhood information according to the preset multiple deletion numbers aiming at the neighborhood information of each participle to obtain multiple residual neighborhood information of the participle.
In the above embodiment, for the neighborhood information of each participle, the present solution may design a plurality of deletion numbers to randomly delete the participles with respect to the number in the neighborhood information, thereby obtaining a plurality of remaining neighborhood information of the participles. According to the foregoing example, the neighborhood information of each participle includes 2N +1 participles, and on this basis, the scheme may design a plurality of ways of deleting 1, 2, and 3 … … of N participles, so that a plurality of remaining neighborhood information may be obtained for the neighborhood information of each participle. In addition, it should be noted here that, on the basis of each deletion number, the present solution may also delete different participles, so that a plurality of remaining neighborhood information may also be obtained on the basis of one deletion number, for example, on the basis of deleting 1 participle, for 2N +1 participles in the neighborhood information, the present solution may delete a first participle in the neighborhood information or delete a participle ordered second in the neighborhood information, and so on, delete a word ordered last participle in the neighborhood information, so that 2N +1 remaining neighborhood information may be obtained on the basis of deleting 1 participle.
Since the remaining neighborhood information is generated by deleting the corresponding participle of the neighborhood information, for example, the participle W is generated 4 When there are more than one remaining neighborhood information, the word W is segmented 4 May also be deleted, thereby generating a word-containing word W 4 And does not contain W 4 And thus, a plurality of pieces of remaining neighborhood information corresponding to each participle can be classified into neighborhood information including the corresponding participle and neighborhood information not including the corresponding participle.
After obtaining the neighborhood information corresponding to each participle in the above manner, the method may execute step S130 to generate a plurality of updated link texts corresponding to each participle according to each remaining neighborhood information in combination with other participles other than the corresponding neighborhood information.
This scheme explains step S130 by way of example as follows: the foregoing description to the participle W 4 The corresponding neighborhood information is (W) 1 ,W 2 ,…,W 7 ) Then the word sequence is divided by (W) 1 ,W 2 ,…,W 7 ) All other participles are participles other than the corresponding neighborhood information described in step S130, for example, the remaining participles may be (W) 8 ,W 9 ,…,W n ). By making a pair of W 4 The residual neighborhood information obtained by word segmentation and deletion of the neighborhood information contains W 4 Has no W contained therein 4 Each residual neighborhood information is associated with other participles (W) 8 ,W 9 ,…,W n ) The combination forms a plurality of updated link texts including link texts with the participles and updated link texts without the participles.
After obtaining the updated link texts, in the present scheme, step S140 is executed to perform malicious link detection on the updated link texts corresponding to each participle, and determine the importance of the participle in the target link text according to the detection result.
As a possible implementation, as shown in fig. 5, the present solution may specifically determine the word segmentation importance degree by the following ways, including:
step S500: and carrying out malicious link detection on the plurality of updated link texts corresponding to each word segmentation to obtain an updated detection result of each updated link text in the plurality of updated link texts corresponding to each word segmentation.
Step S510: a first number of updated link texts with participles is calculated that are identical to the original detection result of the target link text.
Step S520: a second number of updated links without word segmentation is calculated that is different from the original detection result of the target link text.
Step S530: and calculating the importance of the corresponding part words to the target link text according to the first quantity and the second quantity.
In step S500, the present solution performs malicious link detection on each updated link text in the updated link texts obtained by each participle, so as to obtain an updated detection result of each link text, and specifically, the present solution may input each updated link text in the updated link texts corresponding to each participle into a preset malicious link detection model, so as to obtain an updated detection result of each updated link text in the updated link texts corresponding to each participle output by the malicious link detection model. The preset malicious link detection model can adopt any mature malicious link detection model at present.
After the updated detection result of each link text corresponding to each word segmentation is obtained, the original detection result of the target link text can be obtained by the scheme, the original detection result can be obtained by inputting the target link text into the preset malicious link detection model, so that the original detection result of the target link text output by the malicious link detection model can be obtained, and the original detection result can represent whether the target link text is a malicious link or not.
On the basis of the above, the present solution may perform step S510 of calculating a first number of updated link texts with participles that are the same as the original detection result of the target link text.
In step S510, since there are a plurality of updated linked texts corresponding to each participle, each updated linked text can obtain a detection result; because the plurality of updated link texts corresponding to each participle comprise the updated link text with the participle and the updated link text without the participle, after the detection result of each updated link text corresponding to each participle and the original detection result of the target link text are obtained, the scheme can find out the updated link text with the detection same as the original detection result firstly, and then find out the number of the updated link texts with the participle from the plurality of updated link texts with the detection result same as the detection result, so that the first number can be obtained.
Meanwhile, step S520 may be further executed to obtain a second number of updated links without word segmentation, which is different from the original detection result of the target link text, and then step S530 is further executed to calculate the importance of the corresponding word segmentation to the target link text according to the first number and the second number.
In step S530, the present scheme may calculate a ratio of the first number and the second number of each participle, so as to obtain an importance of the corresponding participle to the target link text, where the first number represents a number of updated link texts with participles that are the same as an original detection result of the target link text, and if the original detection result of the target link text is a malicious link, a greater number of detection results of the updated link text with the participle is the same as the original detection result, that is, the greater the first number is, it indicates that the greater a degree of contribution of the participle to the link being the malicious link is, that is, the important characteristic that the participle is the malicious link is provided; or the second quantity represents the quantity of the updated link texts which are different from the original detection results of the target link texts and do not have the participle, if the original detection results of the target link texts are malicious links, the more the detection results of the updated link texts which do not have the participle are different from the original detection results, namely the smaller the second quantity is, the more important the participle is to the target link texts belonging to the malicious links can be also shown; in addition, the scheme can also be used for judging by combining the two, the ratio of the two is calculated, and the larger the ratio of the two is, the more important the word segmentation is for the target link text is.
In an optional implementation manner of this embodiment, after performing step S140 to determine the importance of the word in the target link text, as shown in fig. 6, the present solution may further perform the following steps:
step S600: and sequencing the multiple participles according to the importance degrees of the multiple participles from large to small, and displaying the multiple participles with the importance degrees sequenced from large to small.
In the above embodiment, the importance degree calculated by each word segmentation in the target link text can be sorted from large to small according to the scheme, so that the importance degree is displayed, that is, the importance degree of the concerned features in the target link text can be explained.
According to the method for determining the word segmentation importance in the link, word segmentation processing is carried out on a target link text to obtain a word segmentation sequence, then neighborhood information of each word segmentation is generated according to a plurality of words of the word segmentation sequence, a plurality of residual neighborhood information corresponding to each word segmentation is generated according to the neighborhood information corresponding to each word segmentation, an updated link text with the word segmentation and an updated link text without the word segmentation corresponding to each word segmentation are generated according to each residual neighborhood information and other words except the corresponding neighborhood information, malicious link detection is carried out on the updated link texts corresponding to each word segmentation, and the word segmentation importance in the target link text is determined according to a detection result. The neighborhood information generated by the scheme can represent the context characteristics of the participle, so that the obtained updated link text considers the context information of each participle, and the obtained updated link texts comprise the updated link text with the participle and the updated link text without the participle, so that the importance degree of the participle on the target link text can be known after malicious link detection is carried out on the updated link text, and further, the link malicious characteristics are automatically extracted and explained.
Fig. 7 shows a schematic block diagram of the apparatus for determining importance of word segmentation in the link provided by the present application, and it should be understood that the apparatus corresponds to the embodiment of the method executed in fig. 1 to 6, and can execute the steps related to the method executed in the foregoing. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device. Specifically, the apparatus includes: the word segmentation processing module 700 is configured to perform word segmentation processing on the target link text to obtain a word segmentation sequence, where the word segmentation sequence includes a plurality of sequentially ordered word segments; a generating module 710, configured to generate neighborhood information corresponding to each participle according to a plurality of participles, where the neighborhood information corresponding to each participle is formed by the participle and N preceding-ranked participles and N following-ranked participles of the participle; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle, wherein the residual neighborhood information corresponding to each participle is obtained by deleting preset participles in the corresponding neighborhood information; generating a plurality of updated link texts corresponding to each participle according to each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the updated link texts comprise link texts with the participles and updated link texts without the participles; and the detection determining module 720 is configured to perform malicious link detection on the multiple updated link texts corresponding to each participle, and determine the importance of the participle in the target link text according to the detection result.
According to the device for determining the importance of the participles in the link, word segmentation processing is firstly carried out on a target link text to obtain a participle sequence, then neighborhood information of each participle is generated according to a plurality of participles of the participle sequence, a plurality of residual neighborhood information corresponding to each participle is generated according to the neighborhood information corresponding to each participle, an updated link text with the participle and an updated link text without the participle corresponding to each participle are generated according to each residual neighborhood information and other participles except the corresponding neighborhood information, so that malicious link detection is carried out on the updated link texts corresponding to each participle, and the importance of the participles in the target link text is determined according to a detection result. The neighborhood information generated by the scheme can represent the context characteristics of the participle, so that the obtained updated link text considers the context information of each participle, and the obtained updated link texts comprise the updated link text with the participle and the updated link text without the participle, so that the importance degree of the participle on the target link text can be known after malicious link detection is carried out on the updated link text, and further, the link malicious characteristics are automatically extracted and explained.
In an optional implementation manner of this embodiment, the detection determining module 720 is specifically configured to perform malicious link detection on a plurality of updated link texts corresponding to each participle, and obtain an update detection result of each updated link text in the plurality of updated link texts corresponding to each participle; calculating a first number of updated link texts with the participles, which are the same as the original detection result of the target link text; calculating a second number of updated links without word segmentation, which is different from the original detection result of the target link text; and calculating the importance of the corresponding part words to the target link text according to the first quantity and the second quantity.
In an optional implementation manner of this embodiment, the detection determining module 720 is further specifically configured to input each updated link text corresponding to each participle into a preset malicious link detection model, and obtain an updated detection result of each updated link text in a plurality of updated link texts corresponding to each participle output by the malicious link detection model.
In an optional implementation manner of this embodiment, the generating module 710 is specifically configured to, for each participle in the participle sequence, extract N participles before and N sequenced participles to form neighborhood information corresponding to the participle, and obtain the neighborhood information corresponding to each participle in the multiple participles.
In an optional implementation manner of this embodiment, the apparatus further includes a calculating module 730, configured to calculate a difference between the number of the segmented words before or after the sorting and the number of the segmented words N if the number of the segmented words before or after the sorting is less than N; and the adding module 740 is configured to add the preset characters with the quantity difference number before or after the sorting of the participles.
In an optional implementation manner of this embodiment, the generating module 710 is further specifically configured to randomly delete, according to a plurality of preset deletion numbers, a corresponding number of segmented words in the neighborhood information for the neighborhood information of each segmented word, and obtain a plurality of remaining neighborhood information of the segmented words.
In an optional implementation manner of this embodiment, the apparatus further includes a presentation module 750, configured to sort the plurality of participles according to the importance degrees of the plurality of participles from large to small, and present the plurality of participles sorted according to the importance degrees from large to small.
As shown in fig. 8, the present application provides an electronic device 8 comprising: the processor 801 and the memory 802, the processor 801 and the memory 802 being interconnected and communicating with each other via a communication bus 803 and/or other form of connection mechanism (not shown), the memory 802 storing a computer program executable by the processor 801 which, when executed by the computing device, is executed by the processor 801 to perform the method performed in any of the alternative implementations, such as steps S100 to S140: performing word segmentation processing on the target link text to obtain a word segmentation sequence; generating neighborhood information corresponding to each participle according to the participles; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle; generating a plurality of updated link texts corresponding to each participle according to each residual neighborhood information in combination with other participles except the corresponding neighborhood information; and carrying out malicious link detection on a plurality of updated link texts corresponding to each word segmentation, and determining the importance of the word segmentation in the target link text according to the detection result.
The present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the method executed by the terminal device or the method executed by the cloud server in any of the foregoing optional implementation manners.
The computer-readable storage medium may be implemented by any type of volatile or nonvolatile Memory device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
The present application provides a computer program product, which when run on a computer, causes the computer to execute a method executed by a terminal device or a method executed by a cloud server in any optional implementation manner.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for determining importance of word segmentation in a link, the method comprising:
performing word segmentation processing on a target link text to obtain a word segmentation sequence, wherein the word segmentation sequence comprises a plurality of words which are sequentially sequenced;
generating neighborhood information corresponding to each participle according to the participles, wherein the neighborhood information corresponding to each participle is formed by the participles and N sequenced participles before and after the participles are sequenced;
generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle, wherein the residual neighborhood information corresponding to each participle is obtained by deleting preset participles in the corresponding neighborhood information;
generating a plurality of updated link texts corresponding to each participle according to each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the updated link texts comprise the updated link texts with the participles and the updated link texts without the participles;
and carrying out malicious link detection on a plurality of updated link texts corresponding to each participle, and determining the importance of the participle in the target link text according to the detection result.
2. The method according to claim 1, wherein the performing malicious link detection on a plurality of updated link texts corresponding to each participle and determining the importance of the participle in the target link text according to a detection result comprises:
carrying out malicious link detection on the plurality of updated link texts corresponding to each participle to obtain an updated detection result of each updated link text in the plurality of updated link texts corresponding to each participle;
calculating a first number of updated link texts with the participles, which are the same as the original detection result of the target link text; and calculating a second number of updated links without the participle different from the original detection result of the target link text;
and calculating the importance of the corresponding part words to the target link text according to the first quantity and the second quantity.
3. The method according to claim 2, wherein the performing malicious link detection on the plurality of updated linked texts corresponding to each participle to obtain an updated detection result of each updated linked text in the plurality of updated linked texts corresponding to each participle comprises:
inputting each updated link text corresponding to each participle into a preset malicious link detection model, and obtaining an updated detection result of each updated link text in a plurality of updated link texts corresponding to each participle output by the malicious link detection model.
4. The method of claim 3, wherein prior to said calculating a first number of updated link texts with said participles that are the same as the original detection result of the target link text, the method further comprises:
and inputting the target link text into the preset malicious link detection model, obtaining a detection result of the target link text output by the malicious link detection model, and obtaining the original detection result.
5. The method of claim 1, wherein generating neighborhood information corresponding to each participle from the plurality of participles comprises:
and aiming at each participle in the participle sequence, extracting N participles before and N participles after the participles are sequenced to form neighborhood information corresponding to the participle, and obtaining the neighborhood information corresponding to each participle in a plurality of participles.
6. The method of claim 5, further comprising:
if the number of the word segmentation before or after the word segmentation is sequenced is less than N, calculating the difference value between the number of the word segmentation before or after the word segmentation and the number of N;
and adding the preset characters with the number difference number before or after the sorting of the participles.
7. The method of claim 1, wherein generating a plurality of remaining neighborhood information corresponding to each segmented word according to the neighborhood information corresponding to each segmented word comprises:
and randomly deleting the corresponding number of participles in the neighborhood information according to a plurality of preset deletion numbers aiming at the neighborhood information of each participle to obtain a plurality of residual neighborhood information of the participle.
8. The method according to claim 1, wherein after determining the importance of the word in the target link text according to the detection result, the method further comprises:
and sequencing the multiple participles according to the importance degrees of the multiple participles from large to small, and displaying the multiple participles with the importance degrees sequenced from large to small.
9. An apparatus for determining importance of a participle in a link, the apparatus comprising:
the word segmentation processing module is used for carrying out word segmentation processing on the target link text to obtain a word segmentation sequence, and the word segmentation sequence comprises a plurality of words which are sequentially sequenced;
the generating module is used for generating neighborhood information corresponding to each participle according to the participles, wherein the neighborhood information corresponding to each participle is formed by the participles and N participles before and N participles after the participles are sequenced; generating a plurality of residual neighborhood information corresponding to each participle according to the neighborhood information corresponding to each participle, wherein the residual neighborhood information corresponding to each participle is obtained by deleting preset participles in the corresponding neighborhood information; generating a plurality of updated link texts corresponding to each participle according to each residual neighborhood information and other participles except the corresponding neighborhood information, wherein the updated link texts comprise link texts with the participles and updated link texts without the participles;
and the detection determining module is used for carrying out malicious link detection on a plurality of updated link texts corresponding to each participle and determining the importance of the participle in the target link text according to a detection result.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.
CN202111616516.5A 2021-12-27 2021-12-27 Method and device for determining importance of word segmentation in link Active CN114330331B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111616516.5A CN114330331B (en) 2021-12-27 2021-12-27 Method and device for determining importance of word segmentation in link

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111616516.5A CN114330331B (en) 2021-12-27 2021-12-27 Method and device for determining importance of word segmentation in link

Publications (2)

Publication Number Publication Date
CN114330331A CN114330331A (en) 2022-04-12
CN114330331B true CN114330331B (en) 2022-09-16

Family

ID=81014641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111616516.5A Active CN114330331B (en) 2021-12-27 2021-12-27 Method and device for determining importance of word segmentation in link

Country Status (1)

Country Link
CN (1) CN114330331B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN111400705A (en) * 2020-03-04 2020-07-10 支付宝(杭州)信息技术有限公司 Application program detection method, device and equipment
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system
CN113051876A (en) * 2021-04-02 2021-06-29 网易(杭州)网络有限公司 Malicious website identification method and device, storage medium and electronic equipment
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113468315A (en) * 2021-09-02 2021-10-01 北京华云安信息技术有限公司 Vulnerability vendor name matching method
CN113688240A (en) * 2021-08-25 2021-11-23 南京中孚信息技术有限公司 Threat element extraction method, device, equipment and storage medium
CN113761218A (en) * 2021-04-27 2021-12-07 腾讯科技(深圳)有限公司 Entity linking method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
US10880330B2 (en) * 2017-05-19 2020-12-29 Indiana University Research & Technology Corporation Systems and methods for detection of infected websites

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN111400705A (en) * 2020-03-04 2020-07-10 支付宝(杭州)信息技术有限公司 Application program detection method, device and equipment
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system
CN113051876A (en) * 2021-04-02 2021-06-29 网易(杭州)网络有限公司 Malicious website identification method and device, storage medium and electronic equipment
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113761218A (en) * 2021-04-27 2021-12-07 腾讯科技(深圳)有限公司 Entity linking method, device, equipment and storage medium
CN113688240A (en) * 2021-08-25 2021-11-23 南京中孚信息技术有限公司 Threat element extraction method, device, equipment and storage medium
CN113468315A (en) * 2021-09-02 2021-10-01 北京华云安信息技术有限公司 Vulnerability vendor name matching method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Deep learning feature exploration for Android malware detection;Nan Zhang 等;《Applied Soft Computing Journal》;20210102;第1-7页 *
基于半监督学习的恶意URL检测方法;麻瓯勃 等;《计算机系统应用》;20201111;第1-10页 *

Also Published As

Publication number Publication date
CN114330331A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
US20220351016A1 (en) Presentation module for webinterface production and deployment system
KR100996311B1 (en) Method and system for detecting spam user created contentucc
US11403532B2 (en) Method and system for finding a solution to a provided problem by selecting a winner in evolutionary optimization of a genetic algorithm
CN111523677B (en) Method and device for realizing interpretation of prediction result of machine learning model
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN109471978B (en) Electronic resource recommendation method and device
CN105338001A (en) Method and device for recognizing phishing website
WO2019061664A1 (en) Electronic device, user's internet surfing data-based product recommendation method, and storage medium
Purwanto et al. PhishSim: aiding phishing website detection with a feature-free tool
WO2021210992A9 (en) Systems and methods for determining entity attribute representations
CN107111607A (en) The system and method detected for language
Meeus et al. Did the neurons read your book? document-level membership inference for large language models
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
CN114117299A (en) Website intrusion tampering detection method, device, equipment and storage medium
CN116823410B (en) Data processing method, object processing method, recommending method and computing device
CN113177407A (en) Data dictionary construction method and device, computer equipment and storage medium
CN114330331B (en) Method and device for determining importance of word segmentation in link
CN115757837B (en) Confidence evaluation method and device for knowledge graph, electronic equipment and medium
CN107241342A (en) A kind of network attack crosstalk detecting method and device
CN115001763B (en) Phishing website attack detection method and device, electronic equipment and storage medium
CN111061924A (en) Phrase extraction method, device, equipment and storage medium
CN113961811A (en) Conversational recommendation method, device, equipment and medium based on event map
CN114330296A (en) New word discovery method, device, equipment and storage medium
CN113656586A (en) Emotion classification method and device, electronic equipment and readable storage medium
CN112597390A (en) Block chain big data processing method based on digital finance and big data server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant