CN113869041A - Keyword combination extraction method and device and electronic equipment - Google Patents

Keyword combination extraction method and device and electronic equipment Download PDF

Info

Publication number
CN113869041A
CN113869041A CN202010619049.0A CN202010619049A CN113869041A CN 113869041 A CN113869041 A CN 113869041A CN 202010619049 A CN202010619049 A CN 202010619049A CN 113869041 A CN113869041 A CN 113869041A
Authority
CN
China
Prior art keywords
text
recognized
row
dimensional
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010619049.0A
Other languages
Chinese (zh)
Inventor
杜雪涛
杜刚
朱艳云
张晨
胡入祯
叶剑飞
戴晶
周宇飞
邵妍
常潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Design Institute Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Design Institute Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010619049.0A priority Critical patent/CN113869041A/en
Publication of CN113869041A publication Critical patent/CN113869041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a keyword combination extraction method, a keyword combination extraction device and electronic equipment, wherein the method comprises the following steps: determining a text to be recognized; performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized; and determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized. According to the method, the device and the electronic equipment provided by the embodiment of the invention, the extracted keyword combination can comprehensively reflect the content characteristics of the text to be recognized, and the recognition accuracy of the junk information is improved.

Description

Keyword combination extraction method and device and electronic equipment
Technical Field
The invention relates to the technical field of information security, in particular to a keyword combination extraction method and device and electronic equipment.
Background
The rapid development of mobile communication technology and internet technology provides convenient service for users. Meanwhile, the junk information causes great trouble to users in various modes such as short messages, multimedia messages and application program push notifications.
In the prior art, the junk information is usually identified by extracting a keyword combination in an information text. The existing keyword combination extraction method can only extract the keyword combinations with similar distances in the information text, the extracted keyword combinations are difficult to comprehensively reflect the content characteristics of the information text, and the identification accuracy rate of the junk information is low.
Disclosure of Invention
The embodiment of the invention provides a keyword combination extraction method, a keyword combination extraction device and electronic equipment, which are used for solving the problems that the content characteristics of an information text are difficult to reflect comprehensively by a keyword combination extracted by the conventional keyword combination extraction method, and the identification accuracy of junk information is low.
In a first aspect, an embodiment of the present invention provides a keyword combination extraction method, including:
determining a text to be recognized;
performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
and determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
Optionally, the performing cyclic shift on the word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized includes:
performing cyclic shift on row elements of a previous row of the current row based on a row sequence number of the current row of the two-dimensional augmentation matrix to obtain row elements of the current row;
updating the next line of the current line as the current line;
and the word segmentation result is a row element of any row in the two-dimensional augmentation matrix.
Optionally, the performing cyclic shift on the row elements of the previous row of the current row based on the row sequence number of the current row of the two-dimensional augmented matrix to obtain the row elements of the current row includes:
determining a shifting direction and an offset corresponding to a current row of the two-dimensional augmentation matrix based on a row sequence number of the current row;
and circularly shifting the line elements of the previous line of the current line based on the shifting direction and the offset, and determining the line elements of the current line.
Optionally, the determining a keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized includes:
inputting the two-dimensional augmentation matrix of the text to be recognized into a text classification model to obtain a text classification result output by the text classification model; the text classification model is obtained by training based on a two-dimensional augmentation matrix of a sample text and a sample text classification result corresponding to the two-dimensional augmentation matrix;
and determining the keyword combination of the text to be recognized based on the text classification result.
Optionally, the inputting the two-dimensional augmentation matrix of the text to be recognized into a text classification model to obtain a text classification result output by the text classification model specifically includes:
inputting the two-dimensional augmentation matrix of the text to be recognized into a combination extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combination extraction layer;
and inputting each word segmentation combination to an identification classification layer of the text classification model to obtain a text classification result output by the identification classification layer.
Optionally, the inputting the two-dimensional augmentation matrix of the text to be recognized into a combination extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combination extraction layer specifically includes:
and inputting the two-dimensional augmentation matrix of the text to be recognized into the combined extraction layer, and sampling the two-dimensional augmentation matrix of the text to be recognized by the combined extraction layer based on the combined length to obtain a plurality of word segmentation combinations output by the combined extraction layer.
Optionally, the determining, based on the text classification result, a keyword combination of the text to be recognized specifically includes:
and if the text classification result is abnormal, determining the keyword combination of the text to be recognized based on the activation value of each word segmentation combination in the text classification model.
Optionally, the number of rows of the two-dimensional augmentation matrix is:
Figure BDA0002562390340000031
in the formula, R is the number of rows of the two-dimensional amplification matrix, and C is the number of columns of the preset two-dimensional amplification matrix.
In a second aspect, an embodiment of the present invention provides a keyword combination extraction apparatus, including:
the text determining unit is used for determining a text to be recognized;
the matrix determining unit is used for segmenting the text to be recognized and circularly shifting a segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
and the combination extraction unit is used for determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the keyword combination extraction method according to the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the keyword combination extraction method according to the first aspect.
According to the keyword combination extraction method, the keyword combination extraction device and the electronic equipment, the text to be recognized is segmented, the segmentation result is circularly shifted, the two-dimensional augmentation matrix of the text to be recognized is obtained, all the segmentation combinations can be extracted through the obtained two-dimensional augmentation matrix, the possibility of various segmentation combinations is exhausted, the content characteristics of the text to be recognized can be comprehensively reflected through the extracted keyword combinations on the basis, and the recognition accuracy of junk information is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a keyword combination extraction method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating text classification result determination according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of generating a two-dimensional augmentation matrix according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a keyword combination extraction apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The keyword combination is a combination of a plurality of keywords, also called a keyword combination strategy, that is, when the information text simultaneously contains a keyword in a certain keyword combination, the information text is considered to match the keyword combination strategy, and when the keyword is set as a word for identifying spam, if the information text contains the keyword combination, the information text is spam. For example, the keywords for identifying spam are combined as "telephone" and "one-bite price", and for a certain SMS, the content of the SMS is "you good! I, the user is a certain floor sales counselor, and can enjoy a price and welcome incoming calls by calling the mobile phone to consult the floor information. The short message simultaneously comprises two keywords of 'telephone' and 'one bite price', and if the keywords are matched with each other, the short message is spam.
In the prior art, a one-dimensional convolutional neural network and a cyclic neural network are generally adopted to extract a keyword combination. For example, for the information "eight folds of full makeup, only one number is restricted". By using this information as word sequence data, the one-dimensional convolutional neural network can only be used to capture adjacent word combinations, such as "cosmetics & full field"; the recurrent neural network can only capture word combinations that are closer, such as "cosmetics & eight folds". For phrases and sentences with longer distance, such as 'cosmetics & I', the one-dimensional convolutional neural network and the cyclic neural network cannot be effectively extracted. When the information text is long, the defects of the prior art become more obvious, the extracted keyword combination is difficult to comprehensively reflect the content characteristics of the information text, and the identification accuracy of the junk information is seriously influenced.
To overcome the defects in the prior art, fig. 1 is a schematic flow chart of a keyword combination extraction method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
step 110, determining a text to be recognized;
specifically, the text to be recognized is an information text which needs to be recognized and classified, so that a keyword combination is extracted from the information text and used for recognizing the junk information.
The text to be recognized can be a text directly acquired from a short message, a multimedia message, a mail, a microblog or an application program (App) push notification and the like, and can also be a text obtained by recognizing and converting voice information in voice chat software. The text to be recognized can be obtained through a mobile terminal or a Personal Computer (PC) terminal. The embodiment of the invention does not specifically limit the acquisition source and the acquisition path of the text to be recognized.
The text to be recognized can be a short message text with a small number of characters or a mail text with a large number of characters.
For example, the text to be recognized may be a short message text "eight folds of cosmetics in a whole field, limited to one number" acquired by a certain mobile terminal.
Step 120, performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
specifically, the word segmentation method for the text to be recognized includes a dictionary-based word segmentation method, a statistical-based word segmentation method, a machine learning word segmentation method, and the like, which is not specifically limited in the embodiment of the present invention.
For example, for the text to be recognized, the word segmentation is performed on the 'eight full-size folds of cosmetics, only one limitation is performed', and the obtained word segmentation results are 'cosmetics', 'full-size folds', 'eight folds', 'only limitation' and 'one limitation', and are 5 in total.
And circularly shifting the word segmentation result obtained by segmenting the word of the text to be recognized to obtain different word segmentation results. For example, the segmentation result of the text to be recognized, i.e., the text to be recognized, is eight-fold in the whole field of cosmetics, and is limited to one number only, is circularly shifted, so that a plurality of segmentation results after circular shifting can be obtained: the Chinese medicine composition comprises the components of cosmetics, whole field cosmetics, eight folds, limitation, one number, first number cosmetics, whole field cosmetics, eight folds and limitation, first number cosmetics, whole field cosmetics, eight folds and the like.
And taking each word segmentation result after cyclic shift as a row of elements of the matrix respectively to obtain a two-dimensional augmentation matrix of the text to be recognized. The position sequence of the word segmentation result in each row of the two-dimensional augmentation matrix of the text to be recognized is different.
And step 130, determining a keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
Specifically, a row element or a column element in the two-dimensional augmentation matrix of the text to be recognized is a word segmentation in the text to be recognized. Since the position sequence of the segmentation results in each row of the two-dimensional augmentation matrix of the text to be recognized is different, the segmentation results in each column are also different in the column direction.
In the two-dimensional augmentation matrix of the text to be recognized, a plurality of adjacent row elements or a plurality of adjacent column elements can be combined at will to obtain different word segmentation combinations. That is to say, according to the two-dimensional augmentation matrix, all the segmentation word combinations of the text to be recognized can be extracted, each obtained segmentation word combination is screened, and whether the text to be recognized contains the preset keyword combination or not can be judged.
According to the keyword combination extraction method provided by the embodiment of the invention, the text to be recognized is segmented, the segmentation result is circularly shifted, the two-dimensional augmentation matrix of the text to be recognized is obtained, all the segmentation combinations can be extracted by the obtained two-dimensional augmentation matrix, the possibility of various segmentation combinations is exhausted, the content characteristics of the text to be recognized can be comprehensively reflected by the extracted keyword combination on the basis, and the recognition accuracy of junk information is improved.
Based on the above embodiment, step 120 includes:
performing cyclic shift on row elements of a previous row of the current row based on a row sequence number of the current row of the two-dimensional augmentation matrix to obtain row elements of the current row;
updating the next line of the current line as the current line;
wherein, the word segmentation result is the row element of any row in the two-dimensional augmentation matrix.
Specifically, the word segmentation result is used as a row element of any row in the two-dimensional augmentation matrix, that is, any row element of any row in the two-dimensional augmentation matrix is one word segmentation in the text to be recognized.
Preferably, the word segmentation result of the text to be recognized can be used as a row element of the first row of the two-dimensional augmentation matrix.
For the current row of the two-dimensional augmentation matrix, a cyclic shift mode can be determined according to the row sequence number of the current row, and the row elements of the previous row are cyclically shifted in a mode determined based on the row sequence number to obtain the row elements of the current row.
And after the row elements of the current row are determined, updating the next row into the current row, circularly shifting the segmentation result, and determining the row elements of each row of the two-dimensional augmentation matrix row by row.
Based on any of the above embodiments, performing cyclic shift on the row elements of the previous row of the current row based on the row sequence number of the current row of the two-dimensional augmented matrix to obtain the row elements of the current row, including:
determining a shift direction and an offset corresponding to a current row based on a row sequence number of the current row of the two-dimensional augmentation matrix;
and circularly shifting the line elements of the previous line of the current line based on the shifting direction and the offset, and determining the line elements of the current line.
Specifically, a cyclic shift mode, that is, a shift direction and an offset corresponding to a current row, are determined according to a row sequence number of the current row of the two-dimensional augmented matrix.
Suppose the row number of the current row of the two-dimensional augmented matrix is i. When i is an even number, the corresponding shift direction of the current row is left, the offset is i-1, namely, the word segmentation result moves i-1 positions to the left; and when i is an odd number, the corresponding shift direction of the current line is right, the offset is i-1, namely the word segmentation result moves to the right by i-1 positions. And continuously executing cyclic shift until all rows of the two-dimensional augmentation matrix are generated.
For example, by using the above method, the word segmentation result of the text to be recognized, which is "eight-fold full-field cosmetics, and limited to one number", is subjected to cyclic shift based on the row number, so that a two-dimensional augmentation matrix corresponding to the text to be recognized can be obtained, and the number of rows and columns of the two-dimensional augmentation matrix is set to 5, as shown in table 1:
TABLE 1 two-dimensional augmentation matrix
Column 1 Column 2 Column 3 Column 4 Column 5
Line 1 Cosmetic preparation All over the field Eight-fold Only limit Number one
Line 2 All over the field Eight-fold Only limit Number one Cosmetic preparation
Line 3 Number one Cosmetic preparation All over the field Eight-fold Only limit
Line 4 Eight-fold Only limit Number one Cosmetic preparation All over the field
Line 5 Only limit Number one Cosmetic preparation All over the field Eight-fold
Based on any of the above embodiments, step 130 includes:
inputting a two-dimensional augmentation matrix of a text to be recognized into a text classification model to obtain a text classification result output by the text classification model; the text classification model is obtained by training based on a two-dimensional augmentation matrix of the sample text and a corresponding sample text classification result;
and determining the keyword combination of the text to be recognized based on the text classification result.
Specifically, fig. 2 is a schematic diagram of a process of determining a text classification result according to an embodiment of the present invention, and as shown in fig. 2, the text classification model is used to perform text classification on a two-dimensional augmentation matrix of an input text to be recognized, and an output text classification result is an identification classification result of the text to be recognized, that is, whether the text to be recognized is spam. The text classification result comprises normal and abnormal results, and if the text classification result is normal, the text to be recognized is not junk information; and if the text classification result is abnormal, the text to be recognized is junk information.
If the text to be recognized is not junk information, ending the judging process, and not judging and extracting the word segmentation combination; if the text to be recognized is junk information, judging and extracting each participle combination in the text to be recognized, extracting the participle combination which most possibly represents the content characteristics of the junk information in the participle combination to serve as a keyword combination, and recognizing the junk information.
Before step 130 is executed, the text classification model may be obtained by pre-training, and specifically, the text classification model may be obtained by training in the following manner: first, a two-dimensional augmentation matrix of a large number of sample texts is collected. And manually marking the corresponding sample text classification result of the sample text. And then, training the two-dimensional augmentation matrix of the sample text and the corresponding sample text classification result to the initial model so as to obtain a text classification model.
The initial model may be a convolutional neural network, and the selection of the initial model is not particularly limited in the embodiment of the present invention.
Based on any of the above embodiments, the two-dimensional augmentation matrix of the text to be recognized is input to the text classification model, and a text classification result output by the text classification model is obtained, which specifically includes:
inputting a two-dimensional augmentation matrix of a text to be recognized into a combined extraction layer of a text classification model to obtain a plurality of word segmentation combinations output by the combined extraction layer;
and inputting each word segmentation combination into an identification classification layer of the text classification model to obtain a text classification result output by the identification classification layer.
Specifically, the text classification model comprises a combination extraction layer and a recognition classification layer.
The combination extraction layer is used for extracting a plurality of word segmentation combinations from the two-dimensional augmentation matrix of the text to be recognized. And inputting the two-dimensional augmentation matrix of the text to be recognized into a combined extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combined extraction layer. For example, "makeup & full size", "makeup & full size & size", and "full size & octave", etc.
The recognition classification layer is used for recognizing each word segmentation combination, performing classification judgment on the text to be recognized and outputting a corresponding text classification result. For example, the word segmentation combination "full field & eight folds" is input to the recognition and classification layer, and the text classification result output by the recognition and classification layer is obtained as abnormal, so that the text "full field eight folds of cosmetics, and only one number is limited as spam.
The recognition classification layer can adopt a weighted average method to perform classification judgment on the text to be recognized, and the embodiment of the invention does not specifically limit the recognition judgment method of the recognition classification layer.
Based on any one of the above embodiments, the two-dimensional augmentation matrix of the text to be recognized is input to the combined extraction layer of the text classification model, so as to obtain a plurality of word segmentation combinations output by the combined extraction layer, and the method specifically includes:
and inputting the two-dimensional augmentation matrix of the text to be recognized into the combined extraction layer, and sampling the two-dimensional augmentation matrix of the text to be recognized by the combined extraction layer based on the combination length to obtain a plurality of word segmentation combinations output by the combined extraction layer.
Specifically, the combination length is the number of participles to be combined.
For example, the text classification model uses a convolutional neural network, the combination length is m, and the convolution window is a rectangle with m rows and one column. Inputting the two-dimensional augmentation matrix of the text to be recognized into the combined extraction layer, sampling the two-dimensional augmentation matrix of the text to be recognized by the combined extraction layer based on the combination length m being 2, and obtaining a plurality of word segmentation combinations output by the combined extraction layer, wherein the result is shown in table 2:
TABLE 2 word segmentation group
Serial number Word segmentation combination Serial number Word segmentation combination
1 Cosmetic preparation&All over the field 6 All over the field&Number one
2 All over the field&Eight-fold 7 Eight-fold&Cosmetic preparation
3 Eight-fold&Only limit 8 Only limit&All over the field
4 Only limit&Number one 9 Number one&Eight-fold
5 Number one&Cosmetic preparation 10 Cosmetic preparation&Only limit
When the storage space is allocated to the participle combination, for example, the storage space of the whole field shared by two participle combinations in the matrix is generated in the two participle combinations of the cosmetic & the whole field and the whole field & the first number.
The keyword combination extraction method provided by the embodiment of the invention allocates the storage space for the participles in the text to be recognized based on the two-dimensional augmentation matrix, thereby realizing the compression of the storage space.
Based on any of the above embodiments, determining a keyword combination of a text to be recognized based on a text classification result specifically includes:
and if the text classification result is abnormal, determining the keyword combination of the text to be recognized based on the activation value of each word segmentation combination in the text classification model.
Specifically, if the text classification result is abnormal, it indicates that the text to be recognized is spam, and it is necessary to determine and extract each segmentation combination in the text to be recognized, extract the segmentation combination that most probably represents spam content features in the segmentation combination, and use the extracted segmentation combination as a keyword combination for recognizing spam.
The activation value of each participle combination in the text classification model can be used for judging whether the participle combination represents the content characteristics of the spam information. The higher the activation value of each participle combination is, the higher the possibility that the participle combination represents the content characteristics of the spam is; the lower the activation value of each participle combination, the less likely that the participle combination represents a spam content feature.
When the text classification model is a convolutional neural network, the activation value may be selected from activation values of neurons in the convolutional neural network.
Each participle combination can be arranged according to the size of the corresponding activation value in a descending order, and a plurality of participle combinations with the top rank are selected as keyword combinations.
Further, for a plurality of selected keyword combinations, the optimal keyword combination can be selected according to a certain principle.
According to the keyword combination extraction method provided by the embodiment of the invention, the text classification result is divided into normal and abnormal texts, and the keyword combination extraction is performed on the abnormal texts in a targeted manner, so that the keyword combination of the normal texts is not extracted, and the interference of normal keywords is reduced.
Based on any of the above embodiments, the number of rows of the two-dimensional augmented matrix is:
Figure BDA0002562390340000101
in the formula, R is the number of rows of the two-dimensional amplification matrix, and C is the number of columns of the preset two-dimensional amplification matrix.
Specifically, the size of the two-dimensional augmentation matrix of the text to be recognized is set according to actual conditions. Different two-dimensional augmentation matrix sizes can be set for different texts to be recognized, and the same two-dimensional augmentation matrix size can also be set for different texts to be recognized.
According to the actual situation, the column number of the two-dimensional augmentation matrix is preset to be C, and then the column number C represents the maximum word number of the text to be recognized. Determining the number of rows of the two-dimensional augmentation matrix according to the maximum number of words:
Figure BDA0002562390340000111
in the formula (I), the compound is shown in the specification,
Figure BDA0002562390340000116
the rounding-down operator.
For example, for the text to be recognized, "eight full-field cosmetic folds, only limited to one", the maximum number of words in the word segmentation result is 5, the number of columns of the two-dimensional augmentation matrix can be determined to be 5, and the number of rows is calculated to be 3 according to the above formula. The two-dimensional amplification matrix that can be obtained is shown in table 3:
TABLE 3 two-dimensional augmentation matrix
Column 1 Column 2 Column 3 Column 4 Column 5
Line 1 Cosmetic preparation All over the field Eight-fold Only limit Number one
Line 2 All over the field Eight-fold Only limit Number one Cosmetic preparation
Line 3 Number one Cosmetic preparation All over the field Eight-fold Only limit
Sampling the two-dimensional augmentation matrix of the text to be recognized with the combination length m being 2 can exhaust all possible word segmentation combinations. The following is an example of a process of verification.
If the text to be recognized is composed of n participles, this is denoted as a sequence of participles w1,…,wi,…wn}. Wherein each participle is denoted wiI is the index of each participle, i ═ 1,2, …, n. M participles can be extracted from the participle set to be combined and expressed as a participle combination sequence
Figure BDA0002562390340000114
Wherein each participle combination is expressed as
Figure BDA0002562390340000115
j is the index of each participle, j equals 1,2, …, m.
According to the permutation and combination principle, the possibility that m participle combinations can be selected from N participles of a text to be recognized is N, and the formula is represented as follows:
Figure BDA0002562390340000112
it can be seen that when m is 2, the possibility of word segmentation combination can be controlled to be
Figure BDA0002562390340000113
As m continues to grow, the probability will grow explosively.
As can be seen from the above formulas, table 2 and table 3, when the number of the participles of the text is n equal to 5 and the combination length m is 2, different combinations of the participles in 10 can be extracted from the text to be recognized. Therefore, the constructed two-dimensional augmentation matrix exhales all word combinations in the text to be recognized.
In practical application, for convenience of operation, for a text to be recognized, if the actual number of word segmentation is greater than the maximum number of words, the word segmentation is only selected according to the maximum number of words, and if the actual number of word segmentation is less than the maximum number of words, the word segmentation result is filled with a vacancy until the number of word segmentation is equal to the maximum number of words.
Fig. 3 is a schematic flow chart of generating a two-dimensional augmentation matrix according to an embodiment of the present invention, and as shown in fig. 3, first, the number of columns and the number of rows of the two-dimensional augmentation matrix are set according to an actual application condition, and a word segmentation result of a text to be recognized is used as a row element of a first row of the two-dimensional augmentation matrix; secondly, determining the shift direction and the offset of the cyclic shift according to the line sequence number of the current line of the two-dimensional augmentation matrix, and determining the current line; and updating the next row into the current row, and determining row elements of each row of the two-dimensional augmentation matrix row by row until the two-dimensional augmentation matrix is generated.
According to the keyword combination extraction method provided by the embodiment of the invention, the text information can be retained to the maximum extent and the matrix space can be minimized by reasonably setting the column number and the row number of the two-dimensional augmentation matrix, the data storage space and the algorithm complexity are reduced, and the extraction speed of the keyword combination is greatly improved.
Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a keyword combination extraction apparatus provided in an embodiment of the present invention, and as shown in fig. 4, the apparatus includes:
a text determining unit 410, configured to determine a text to be recognized;
the matrix determining unit 420 is configured to perform word segmentation on the text to be recognized, and perform cyclic shift on a segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
and the combination extraction unit 430 is configured to determine a keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
Specifically, the text determination unit 410 is configured to determine a text to be recognized. The text to be recognized is an information text which needs to be recognized and classified, so that a keyword combination is extracted from the text to be recognized and used for recognizing the junk information. The matrix determining unit 420 is configured to perform word segmentation on the text to be recognized, and perform cyclic shift on the word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized. And circularly shifting the word segmentation result to obtain different word segmentation results. And taking each word segmentation result after cyclic shift as a row of elements of the matrix respectively to obtain a two-dimensional augmentation matrix of the text to be recognized. The combination extraction unit 430 is configured to determine a keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
The keyword combination extraction device provided by the embodiment of the invention can extract all the word combinations by performing word segmentation on the text to be recognized and performing cyclic shift on the word segmentation result to obtain the two-dimensional augmentation matrix of the text to be recognized, and can exhaust the possibility of various word combinations.
Based on the above embodiment, the matrix determination unit 420 includes:
a current row determining subunit, configured to perform cyclic shift on row elements in a previous row of a current row based on a row sequence number of the current row of the two-dimensional augmented matrix to obtain row elements of the current row;
the updating subunit is used for updating the next line of the current line into the current line;
wherein, the word segmentation result is the row element of any row in the two-dimensional augmentation matrix.
Based on any of the embodiments above, the current row determination subunit is specifically configured to:
determining a shift direction and an offset corresponding to a current row based on a row sequence number of the current row of the two-dimensional augmentation matrix;
and circularly shifting the line elements of the previous line of the current line based on the shifting direction and the offset, and determining the line elements of the current line.
Based on any of the above embodiments, the combination extraction unit 430 includes:
the text classification subunit is used for inputting the two-dimensional augmentation matrix of the text to be recognized into the text classification model to obtain a text classification result output by the text classification model; the text classification model is obtained by training based on a two-dimensional augmentation matrix of the sample text and a corresponding sample text classification result;
and the keyword combination extraction subunit is used for determining the keyword combination of the text to be recognized based on the text classification result.
Based on any of the above embodiments, the text classification subunit includes:
the combined extraction module is used for inputting the two-dimensional augmentation matrix of the text to be recognized into a combined extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combined extraction layer;
and the recognition and classification module is used for inputting each word segmentation combination to a recognition and classification layer of the text classification model to obtain a text classification result output by the recognition and classification layer.
Based on any of the above embodiments, the combination extraction module is specifically configured to:
and inputting the two-dimensional augmentation matrix of the text to be recognized into the combined extraction layer, and sampling the two-dimensional augmentation matrix of the text to be recognized by the combined extraction layer based on the combination length to obtain a plurality of word segmentation combinations output by the combined extraction layer.
Based on any of the above embodiments, the keyword combination extraction subunit is specifically configured to:
and if the text classification result is abnormal, determining the keyword combination of the text to be recognized based on the activation value of each word segmentation combination in the text classification model.
Based on any of the above embodiments, the number of rows of the two-dimensional augmented matrix is:
Figure BDA0002562390340000141
in the formula, R is the number of rows of the two-dimensional amplification matrix, and C is the number of columns of the preset two-dimensional amplification matrix.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logical commands in memory 530 to perform the following method:
determining a text to be recognized; performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized; and determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
In addition, the logic commands in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic commands are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes:
determining a text to be recognized; performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized; and determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A keyword combination extraction method is characterized by comprising the following steps:
determining a text to be recognized;
performing word segmentation on the text to be recognized, and performing cyclic shift on a word segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
and determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
2. The keyword combination extraction method according to claim 1, wherein the performing cyclic shift on the word segmentation result to obtain the two-dimensional augmentation matrix of the text to be recognized comprises:
performing cyclic shift on row elements of a previous row of the current row based on a row sequence number of the current row of the two-dimensional augmentation matrix to obtain row elements of the current row;
updating the next line of the current line as the current line;
and the word segmentation result is a row element of any row in the two-dimensional augmentation matrix.
3. The method of claim 2, wherein the cyclically shifting row elements of a row previous to a current row of the two-dimensional augmented matrix based on a row sequence number of the current row to obtain row elements of the current row comprises:
determining a shifting direction and an offset corresponding to a current row of the two-dimensional augmentation matrix based on a row sequence number of the current row;
and circularly shifting the line elements of the previous line of the current line based on the shifting direction and the offset, and determining the line elements of the current line.
4. The method for extracting keyword combinations according to claim 1, wherein the determining the keyword combinations of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized comprises:
inputting the two-dimensional augmentation matrix of the text to be recognized into a text classification model to obtain a text classification result output by the text classification model; the text classification model is obtained by training based on a two-dimensional augmentation matrix of a sample text and a sample text classification result corresponding to the two-dimensional augmentation matrix;
and determining the keyword combination of the text to be recognized based on the text classification result.
5. The keyword combination extraction method according to claim 4, wherein the step of inputting the two-dimensional augmentation matrix of the text to be recognized into a text classification model to obtain a text classification result output by the text classification model specifically comprises:
inputting the two-dimensional augmentation matrix of the text to be recognized into a combination extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combination extraction layer;
and inputting each word segmentation combination to an identification classification layer of the text classification model to obtain a text classification result output by the identification classification layer.
6. The keyword combination extraction method according to claim 5, wherein the step of inputting the two-dimensional augmentation matrix of the text to be recognized into a combination extraction layer of the text classification model to obtain a plurality of word segmentation combinations output by the combination extraction layer specifically comprises:
and inputting the two-dimensional augmentation matrix of the text to be recognized into the combined extraction layer, and sampling the two-dimensional augmentation matrix of the text to be recognized by the combined extraction layer based on the combined length to obtain a plurality of word segmentation combinations output by the combined extraction layer.
7. The method for extracting a keyword combination according to claim 5, wherein the determining a keyword combination of the text to be recognized based on the text classification result specifically includes:
and if the text classification result is abnormal, determining the keyword combination of the text to be recognized based on the activation value of each word segmentation combination in the text classification model.
8. The keyword combination extraction method according to any one of claims 1 to 7, wherein the number of rows of the two-dimensional augmentation matrix is:
Figure FDA0002562390330000021
in the formula, R is the number of rows of the two-dimensional amplification matrix, and C is the number of columns of the preset two-dimensional amplification matrix.
9. A keyword combination extraction device is characterized by comprising:
the text determining unit is used for determining a text to be recognized;
the matrix determining unit is used for segmenting the text to be recognized and circularly shifting a segmentation result to obtain a two-dimensional augmentation matrix of the text to be recognized;
and the combination extraction unit is used for determining the keyword combination of the text to be recognized based on the two-dimensional augmentation matrix of the text to be recognized.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the keyword combination extraction method according to any one of claims 1 to 8 when executing the computer program.
CN202010619049.0A 2020-06-30 2020-06-30 Keyword combination extraction method and device and electronic equipment Pending CN113869041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010619049.0A CN113869041A (en) 2020-06-30 2020-06-30 Keyword combination extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010619049.0A CN113869041A (en) 2020-06-30 2020-06-30 Keyword combination extraction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113869041A true CN113869041A (en) 2021-12-31

Family

ID=78981652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010619049.0A Pending CN113869041A (en) 2020-06-30 2020-06-30 Keyword combination extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113869041A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116471344A (en) * 2023-04-27 2023-07-21 无锡沐创集成电路设计有限公司 Keyword extraction method, device and medium for data message

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116471344A (en) * 2023-04-27 2023-07-21 无锡沐创集成电路设计有限公司 Keyword extraction method, device and medium for data message
CN116471344B (en) * 2023-04-27 2023-11-21 无锡沐创集成电路设计有限公司 Keyword extraction method, device and medium for data message

Similar Documents

Publication Publication Date Title
CN103336766B (en) Short text garbage identification and modeling method and device
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US20230385409A1 (en) Unstructured text classification
CN110569354B (en) Barrage emotion analysis method and device
CN111753551B (en) Information generation method and device based on word vector generation model
CN109472207A (en) Emotion identification method, apparatus, equipment and storage medium
CN110019758B (en) Core element extraction method and device and electronic equipment
CN111193657A (en) Chat expression reply method, device and storage medium
CN113255331B (en) Text error correction method, device and storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN110362826A (en) Periodical submission method, equipment and readable storage medium storing program for executing based on artificial intelligence
CN113850251A (en) Text correction method, device and equipment based on OCR technology and storage medium
TW202016765A (en) Text restoration method and device and electronic equipment
CN110222234B (en) Video classification method and device
CN111357015A (en) Speech synthesis method, apparatus, computer device and computer-readable storage medium
CN108090044B (en) Contact information identification method and device
CN111079386A (en) Address recognition method, device, equipment and storage medium
CN113869041A (en) Keyword combination extraction method and device and electronic equipment
CN104699662B (en) The method and apparatus for identifying overall symbol string
CN108932069A (en) Input method candidate entry determines method, apparatus, equipment and readable storage medium storing program for executing
JP7130881B2 (en) METHOD FOR GENERATING WORDCODE, METHOD AND APPARATUS FOR RECOGNIZING WORDCODE, COMPUTER PROGRAM
CN111898363A (en) Method and device for compressing long and difficult sentences of text, computer equipment and storage medium
CN109947932B (en) Push information classification method and system
CN110555431B (en) Image recognition method and device
CN107784328A (en) The old character recognition method of German, device and computer-readable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination