CN113111147A

CN113111147A - Text type identification method and device, electronic equipment and storage medium

Info

Publication number: CN113111147A
Application number: CN202010032858.1A
Authority: CN
Inventors: 艾江俊; 罗杰
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2021-07-13

Abstract

The application discloses a text type identification method, a text type identification device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a file to be detected; identifying the suffix name of the file to be detected to obtain a first identification result; identifying keywords in the text content of the file to be detected to obtain a second identification result; and determining the text type of the file to be detected by combining the first recognition result and the second recognition result. Therefore, when the text type is identified, not only the suffix name of the file to be detected is identified, but also the keyword in the text content of the file is identified, and the text type of the file to be detected is obtained by combining the identification result of the suffix name and the identification result of the keyword, so that the problem of misjudgment possibly caused by identification only through the suffix name is avoided, and the identification accuracy is improved.

Description

Text type identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text type recognition method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of the internet, more and more language categories appear, and the styles of the languages are similar and completely different. When one detection engine needs to process a plurality of different languages simultaneously, for example, malicious Webshell backdoor detection, program detection is performed without distinguishing language types, and a false report will be generated because the attack methods of two scripting languages are completely opposite. Therefore, a text type identification method is needed to identify the language type of the file to be detected and then detect the file according to the characteristics of each language.

When the traditional technology identifies the language type of a file to be detected, the traditional technology generally directly judges which language type the file belongs to through the suffix name of the file. However, this method is premised on a trusted document suffix. If an attacker changes the script suffix into other languages in some scenes, the text type is misjudged. Therefore, how to provide a text type recognition method that solves the above problems is a great concern for those skilled in the art.

Disclosure of Invention

The application aims to provide a text type identification method and device, an electronic device and a computer readable storage medium, and identification accuracy is improved.

In order to achieve the above object, the present application provides a text type recognition method, including:

acquiring a file to be detected;

identifying the suffix name of the file to be detected to obtain a first identification result;

identifying keywords in the text content of the file to be detected to obtain a second identification result;

and determining the text type of the file to be detected by combining the first recognition result and the second recognition result.

Optionally, identifying the suffix name of the file to be detected to obtain a first identification result, including:

acquiring the file name of the file to be detected;

extracting a suffix name of the file to be detected from the file name;

and searching whether a character string identical to the suffix name exists in a preset character string set or not to obtain a first identification result aiming at the suffix name.

Optionally, recognizing the keywords in the text content of the file to be detected to obtain a second recognition result, including:

reading the text content of the file to be detected, and extracting keywords from the text content by utilizing multi-mode matching;

analyzing and counting each keyword to obtain a score corresponding to each language type;

and determining the language type with the highest score as the second recognition result.

Optionally, the method further includes:

collecting keywords corresponding to each language type to obtain a preset keyword set;

and obtaining the corresponding preset weight of each preset keyword under each language type by counting the occurrence probability of each preset keyword in the preset keyword set or performing weighted calculation on each preset keyword by utilizing machine learning.

Optionally, analyzing and counting each keyword to obtain a score corresponding to each language type, including:

counting the occurrence frequency of each keyword, and acquiring a corresponding preset weight of each keyword under each language type;

and calculating a weighting score corresponding to each language type according to the occurrence times and the preset weight.

Optionally, the determining the text type of the file to be detected by combining the first recognition result and the second recognition result includes:

if the first recognition result is the same as the second recognition result, directly determining the first recognition result or the second recognition result as the text type of the file to be detected;

if the first recognition result is different from the second recognition result, judging whether the weighted score is greater than or equal to a preset threshold value;

if so, determining the second recognition result as the text type of the file to be detected;

and if not, determining the first recognition result as the text type of the file to be detected.

To achieve the above object, the present application provides a text type recognition apparatus, comprising:

the file acquisition module is used for acquiring a file to be detected;

the first identification module is used for identifying the suffix name of the file to be detected to obtain a first identification result;

the second identification module is used for identifying keywords in the text content of the file to be detected to obtain a second identification result;

and the type determining module is used for determining the text type of the file to be detected by combining the first recognition result and the second recognition result.

Optionally, the second identification module includes:

the keyword extraction unit is used for reading the text content of the file to be detected and extracting keywords from the text content by utilizing multi-mode matching;

the score counting unit is used for analyzing and counting each keyword to obtain a score corresponding to each language type;

and the result determining unit is used for determining the language type with the highest score as the second recognition result.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of any of the text type recognition methods disclosed above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the text-type recognition methods disclosed in the foregoing.

According to the scheme, the text type identification method provided by the application comprises the following steps: acquiring a file to be detected; identifying the suffix name of the file to be detected to obtain a first identification result; identifying keywords in the text content of the file to be detected to obtain a second identification result; and determining the text type of the file to be detected by combining the first recognition result and the second recognition result. Therefore, when the text type is identified, not only the suffix name of the file to be detected is identified, but also the keyword in the text content of the file is identified, and the text type of the file to be detected is obtained by combining the identification result of the suffix name and the identification result of the keyword, so that the problem of misjudgment possibly caused by identification only through the suffix name is avoided, and the identification accuracy is improved.

The application also discloses a text type recognition device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a text type identification method disclosed in an embodiment of the present application;

FIG. 2 is a flow chart of another text type identification method disclosed in the embodiments of the present application;

fig. 3 is a block diagram of a text type recognition apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of an electronic device disclosed in an embodiment of the present application;

fig. 5 is a block diagram of another electronic device disclosed in the embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, when identifying the language type of a file to be detected, it is usually determined directly by the suffix name of the file which language type the file belongs to. However, this method is premised on a trusted document suffix. If an attacker changes the script suffix into other languages in some scenes, the text type is misjudged.

Therefore, the embodiment of the application discloses a text type identification method, and the identification accuracy is improved.

Referring to fig. 1, a text type identification method disclosed in an embodiment of the present application includes:

s101: acquiring a file to be detected;

in the embodiment of the application, the file to be detected needs to be acquired first. The file to be detected is a file whose text type needs to be detected, and may be specifically a script file or the like. When the file to be detected is obtained, a preset receiving interface can be provided to obtain the file to be detected sent by a user, the file to be detected uploaded by the user can also be obtained through a preset import interface, and the file to be detected can also be downloaded through a network. The process of acquiring the file to be detected is not specifically limited.

S102: identifying the suffix name of the file to be detected to obtain a first identification result;

in this step, a first recognition result for the suffix name will be obtained by recognizing the suffix name of the file to be detected. Specifically, the process of identifying the suffix name of the file to be detected to obtain the first identification result may include: acquiring a file name of a file to be detected; extracting a suffix name of a file to be detected from the file name; and searching whether a character string identical to the suffix name exists in a preset character string set or not to obtain a first identification result aiming at the suffix name.

It will be appreciated that typically a suffix name of a file is added at the end of a primary file name, the suffix name and the primary file name being separated by a ". quadrature.. After the file name of the file to be detected is acquired, the file suffix name after the separator can be extracted by identifying the separator.

It should be noted that, in the present application, all possible file types may be collected in advance, and all suffix names corresponding to all possible file types are taken as a preset character string set. And after the suffix name of the file is extracted, matching in a preset character string set by using the suffix name, and if the matching is successful, characterizing and identifying the suffix name of the current file, wherein the file type corresponding to the suffix name is the identification result aiming at the suffix name.

S103: identifying keywords in the text content of the file to be detected to obtain a second identification result;

in the embodiment of the application, the text content of the file to be detected is read, and the keywords in the text content are identified to obtain a second identification result aiming at the keywords.

It should be noted that, in the embodiment of the present application, the execution order between the step S102 and the step S103 is not limited, that is, the step S102 may be executed first, the step S103 may also be executed first, and the two steps may also be executed simultaneously, so as to implement the concurrent identification processing.

S104: and determining the text type of the file to be detected by combining the first recognition result and the second recognition result.

It can be understood that after obtaining the first recognition result for the suffix name and the second recognition result for the keyword, the application can obtain a more accurate text type recognition result through comprehensive analysis of the two results.

In specific implementation, if a corresponding recognition result cannot be obtained after the suffix name is recognized, the second recognition result for the keyword can be directly determined as the text type of the file to be detected. If the corresponding recognition results cannot be obtained for both the suffix name and the keyword, the current file to be detected can be judged to be failed to be recognized, and prompt information can be output to remind a user of manual recognition.

The embodiment of the application discloses another text type identification method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Referring to fig. 2, specifically:

s201: acquiring a file to be detected;

s202: identifying the suffix name of the file to be detected to obtain a first identification result;

s203: reading the text content of the file to be detected, and extracting keywords from the text content by utilizing multi-mode matching;

in the embodiment of the application, the text content of the file to be detected is read, and then the keywords are extracted from the text content by utilizing the multi-mode matching algorithm. The multi-mode matching algorithm is specifically an algorithm for matching a plurality of keywords from a text, an AC (Aho-coral automation) automaton is a common multi-mode matching algorithm, and a process of performing multi-mode matching by using the AC automaton may specifically include: firstly, a dictionary tree is built by using all preset keywords, and a fail pointer can be built on the dictionary tree, wherein the fail pointer is used for determining a position from which matching should be continued after matching fails on the dictionary tree. And then the text content can be utilized to match keywords on the dictionary tree.

S204: analyzing and counting each keyword to obtain a score corresponding to each language type;

in this step, after the keywords are extracted, the occurrence frequency of each keyword may be counted, the preset weight corresponding to each language type of each keyword is obtained, and the weighting score corresponding to each language type is calculated according to the occurrence frequency and the preset weight. It is understood that the same keyword may exist in a plurality of languages, but the weight represented in each language may be different, and thus the same keyword may correspond to a plurality of weights, i.e., different weights respectively under a plurality of language types.

It should be noted that, in the embodiment of the present application, keywords corresponding to each language type may be collected in advance to obtain a preset keyword set, and each preset keyword in the preset keyword set is weighted to determine a preset weight corresponding to each preset keyword in each language type. When weighting is performed on each preset keyword in the preset keyword set, the occurrence probability of each preset keyword in the preset keyword set may be counted to obtain the preset weight, or machine learning may be used to perform weighting calculation on each preset keyword to obtain the preset weight. As a specific implementation manner, in the present application, first, for each language type, corresponding sample data is collected as a sample set for machine learning training, and then the sample set is preprocessed, that is, each sample is processed to generate a corresponding feature string, where the feature may be a keyword, each keyword is a one-dimensional feature, and the feature string is a sequence formed by the occurrence frequency of each keyword. When the machine learning model is trained, a machine learning algorithm is selected, the generated feature string is placed in the algorithm for training, and the corresponding machine learning model is obtained after the training is finished. The machine learning model can also be used for predicting the type of a script, and the feature importance scores, namely the weights of keywords, are generated simultaneously with the machine learning model, and can be output in common algorithms such as XGboost, GDBT, Random Forest, tree and the like. In the process of training the model by using machine learning, the importance index of each feature can be obtained by calling a feature _ importance _ (function), and the importance index is used as the weight of a keyword.

After counting the occurrence frequency of each keyword and acquiring the corresponding preset weight of each keyword under each language type, calculating the corresponding weighted score of each language type through a preset score calculation formula, wherein the preset score calculation formula is

Wherein num () is the occurrence frequency of the keyword, weight () is the preset weight of the keyword, all _ word is all keywords in the current language type, and Score is the calculated weighted Score.

S205: determining the language type with the highest score as the second recognition result;

it is to be understood that, after the score corresponding to each language type is obtained, the language type with the highest score may be determined as the second recognition result for the keyword.

S206: and determining the text type of the file to be detected by combining the first recognition result and the second recognition result.

In the embodiment of the application, if the first recognition result is the same as the second recognition result, the text type corresponding to the first recognition result or the second recognition result can be directly determined as the text type of the file to be detected; if the first recognition result is different from the second recognition result, judging whether the weighting score obtained in the second recognition result is greater than or equal to a preset threshold value; if the weighted score is greater than or equal to a preset threshold value, determining the text type corresponding to the second recognition result as the text type of the file to be detected; and if the weighted score is smaller than the preset threshold, determining the text type corresponding to the first recognition result as the text type of the file to be detected. The preset threshold is a threshold that can be set according to a specific scene in a specific implementation process, and is not limited herein. If the weighted score is smaller than the preset threshold, the accuracy of the corresponding identification result is low, and the situation of misjudgment may exist, so that the identification result aiming at the suffix name can be used as the final identification result; if the weighted score is larger than or equal to the preset threshold, the accuracy of the recognition result corresponding to the representation is high, and the recognition result aiming at the keyword can be directly used as the final recognition result.

In the following, a text type recognition apparatus provided by an embodiment of the present application is introduced, and a text type recognition apparatus described below and a text type recognition method described above may be referred to each other.

Referring to fig. 3, a text type recognition apparatus provided in an embodiment of the present application includes:

the file acquisition module 301 is used for acquiring a file to be detected;

the first identification module 302 is configured to identify a suffix name of the file to be detected, so as to obtain a first identification result;

the second identification module 303 is configured to identify keywords in the text content of the file to be detected, so as to obtain a second identification result;

and a type determining module 304, configured to determine a text type of the file to be detected by combining the first recognition result and the second recognition result.

On the basis of the foregoing embodiment, as a preferred implementation manner, the first identification module 302 includes:

the name acquisition unit is used for acquiring the file name of the file to be detected;

a suffix name unit, which is used for extracting the suffix name of the file to be detected from the file name;

and the character string searching unit is used for searching whether a character string which is the same as the suffix name exists in a preset character string set or not to obtain a first identification result aiming at the suffix name.

On the basis of the foregoing embodiment, as a preferred implementation manner, the second identification module 303 includes:

On the basis of the foregoing embodiment, as a preferred implementation, the text type recognition apparatus may further include:

the keyword collection module is used for collecting keywords corresponding to each language type to obtain a preset keyword set;

and the weighting calculation module is used for obtaining the corresponding preset weight of each preset keyword under each language type by counting the occurrence probability of each preset keyword in the preset keyword set or performing weighting calculation on each preset keyword by utilizing machine learning.

On the basis of the foregoing embodiment, as a preferred implementation manner, the score statistic unit includes:

the weight obtaining subunit is used for counting the occurrence frequency of each keyword and obtaining the corresponding preset weight of each keyword under each language type;

and the score calculating subunit is used for calculating a weighted score corresponding to each language type according to the occurrence times and the preset weight.

On the basis of the foregoing embodiment, as a preferred implementation manner, the type determining module 304 includes:

the first determining unit is used for directly determining the first recognition result or the second recognition result as the text type of the file to be detected if the first recognition result is the same as the second recognition result;

the score judging module is used for judging whether the weighted score is greater than or equal to a preset threshold value or not if the first recognition result is different from the second recognition result;

the second determining unit is used for determining the second recognition result as the text type of the file to be detected if the weighted score is greater than or equal to a preset threshold value;

and the third determining unit is used for determining the first recognition result as the text type of the file to be detected if the weighted score is smaller than a preset threshold value.

The present application further provides an electronic device, and as shown in fig. 4, an electronic device provided in an embodiment of the present application includes:

a memory 100 for storing a computer program;

the processor 200, when executing the computer program, may implement the steps provided by the above embodiments.

Specifically, the memory 100 includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions, and the internal memory provides an environment for the operating system and the computer-readable instructions in the non-volatile storage medium to run. The processor 200 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and provides computing and controlling capability for the electronic device, and when executing the computer program stored in the memory 100, the steps of the text type identification method disclosed in any of the foregoing embodiments may be implemented.

On the basis of the above embodiment, as a preferred implementation, referring to fig. 5, the electronic device further includes:

and an input interface 300 connected to the processor 200, for acquiring computer programs, parameters and instructions imported from the outside, and storing the computer programs, parameters and instructions into the memory 100 under the control of the processor 200. The input interface 300 may be connected to an input device for receiving parameters or instructions manually input by a user. The input device may be a touch layer covered on a display screen, or a button, a track ball or a touch pad arranged on a terminal shell, or a keyboard, a touch pad or a mouse, etc.

And a display unit 400 connected to the processor 200 for displaying data processed by the processor 200 and for displaying a visualized user interface. The display unit 400 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like.

And a network port 500 connected to the processor 200 for performing communication connection with each external terminal device. The communication technology adopted by the communication connection can be a wired communication technology or a wireless communication technology, such as a mobile high definition link (MHL) technology, a Universal Serial Bus (USB), a High Definition Multimedia Interface (HDMI), a wireless fidelity (WiFi), a bluetooth communication technology, a low power consumption bluetooth communication technology, an ieee802.11 s-based communication technology, and the like.

While FIG. 5 shows only an electronic device having the

assembly

100 and 500, those skilled in the art will appreciate that the configuration shown in FIG. 5 does not constitute a limitation of the electronic device, and may include fewer or more components than shown, or some components may be combined, or a different arrangement of components.

The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk. The storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the text type recognition method disclosed in any one of the preceding embodiments.

When the text type is identified, not only the suffix name of the file to be detected is identified, but also the keyword in the text content of the file is identified, and the text type of the file to be detected is obtained by combining the identification result of the suffix name and the identification result of the keyword, so that the problem of misjudgment possibly caused by identification only through the suffix name is avoided, and the identification accuracy is improved.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A text type recognition method, comprising:

acquiring a file to be detected;

2. The text type recognition method according to claim 1, wherein recognizing the suffix name of the file to be detected to obtain a first recognition result comprises:

acquiring the file name of the file to be detected;

extracting a suffix name of the file to be detected from the file name;

3. The text type recognition method according to claim 1, wherein recognizing the keywords in the text content of the file to be detected to obtain a second recognition result comprises:

4. The text type recognition method according to claim 3, further comprising:

5. The text type recognition method of claim 4, wherein analyzing and counting each keyword to obtain a score corresponding to each language type comprises:

6. The text type recognition method according to claim 5, wherein the determining the text type of the file to be detected by combining the first recognition result and the second recognition result comprises:

7. A text type recognition apparatus, comprising:

the file acquisition module is used for acquiring a file to be detected;

8. The text type recognition apparatus according to claim 7, wherein the second recognition module comprises:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the text type recognition method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text type recognition method according to one of claims 1 to 6.