CN117493493A

CN117493493A - Keyword definition method, keyword definition device, computer equipment and storage medium

Info

Publication number: CN117493493A
Application number: CN202311615082.6A
Authority: CN
Inventors: 邓维; 杨恺; 杨念梓; 贾唯秦
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-02

Abstract

The application relates to a keyword definition method, a keyword definition device, computer equipment and a storage medium. The application relates to the technical field of prompt learning and artificial intelligence. The method comprises the following steps: acquiring a plurality of text information needing classification definition, and extracting keywords of each text information; analyzing the distribution probability of the keywords of each text message aiming at each text message, thereby screening the target keywords of the text message; generating feature vectors of the text information based on target keywords of the text information and the distribution probability thereof aiming at each text information, respectively calculating the similarity between each feature vector so as to determine text characterization information, and calculating label characterization information of the text category aiming at each text category; and screening text characterization information with the maximum distribution probability between the text characterization information and the tag characterization information, and taking the text characterization information as keyword definition information of the text category. By adopting the method, the definition accuracy of the keywords of the text classification in the professional field can be improved.

Description

Keyword definition method, keyword definition device, computer equipment and storage medium

Technical Field

The present application relates to the field of prompt learning and artificial intelligence technologies, and in particular, to a keyword definition method, apparatus, computer device, and storage medium.

Background

The traditional prompt learning model is simpler to define keywords (verbalizer) for the classified text in the general field, but generally requires more field knowledge for text classification of data in the professional field (such as financial field), so that the definition of the keywords for text classification in the professional field is very difficult. Therefore, how to improve the definition accuracy of keywords of text classification in the professional field is the current research focus.

The traditional technical scheme is that keywords of text classification in the professional field are defined manually, but the method not only needs to consume a large amount of manpower, but also the defined keywords can not summarize the data characteristics of data in the text category, so that the definition accuracy of the keywords of the text classification in the professional field is lower.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a keyword definition method, apparatus, computer device, computer readable storage medium, and computer program product.

In a first aspect, the present application provides a keyword definition method. The method comprises the following steps:

acquiring a plurality of text information needing classification definition, and extracting keywords of each text information;

Analyzing the distribution probability of keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;

for each piece of text information, generating a feature vector of the text information based on a target keyword of the text information and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;

taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifying the text category corresponding to each text information;

for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and respectively calculating distribution probability between each text characterization information and the tag characterization information;

and screening text characterization information with the maximum distribution probability as keyword definition information of the text category.

Optionally, the extracting the keyword of each text message includes:

dividing the text information according to the language segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying semantic features of each text segment through a text feature identification network;

Calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment;

and taking the sub-keywords of all the text segments as keywords of the text information.

Optionally, the analyzing the distribution probability of the keywords of each text message in the text message includes:

identifying, for each keyword, the number of the keywords in the text information and identifying the number of all word information of the text information;

and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.

Optionally, the calculating the average similarity between each feature vector and other feature vectors includes:

the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors;

For each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.

Optionally, the identifying the text category corresponding to each text message includes:

acquiring sample text categories, and respectively extracting category characteristic information of each sample text category;

identifying text content corresponding to each category characteristic information, and respectively calculating similarity distances between target keywords of each text information and the text content corresponding to each category characteristic information;

and screening a sample text category to which the category characteristic information corresponding to the minimum similarity distance belongs as the text category of the text information aiming at each text information.

Optionally, the calculating the distribution probability between each text characterization information and the tag characterization information includes:

identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information;

Calculating distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among each text characterization information and each tag characterization information.

Optionally, after the text characterization information with the maximum distribution probability is used as the keyword definition information of the text category, the method further includes:

acquiring actual keyword definition information of each text category in response to a keyword definition information uploading operation of a user, and identifying keyword definition information of the text category and deviation information between the actual keyword definition information of the text category for each text category;

and calculating a proportion value between the deviation information and the keyword definition information, and adjusting a scoring parameter of the scoring function to the text category based on the deviation information under the condition that the proportion value is larger than a preset proportion threshold value to obtain a new scoring function.

In a second aspect, the present application further provides a keyword definition apparatus. The device comprises:

the acquisition module is used for acquiring a plurality of text information needing classification definition and extracting keywords of each text information;

the analysis module is used for analyzing the distribution probability of the keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;

the generation module is used for generating a feature vector of each text message based on the target keyword of the text message and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;

the recognition module is used for taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information and recognizing the text category corresponding to each text information;

the computing module is used for computing label characterization information of each text category based on the text characterization information of each text information corresponding to the text category and respectively computing the distribution probability between each text characterization information and the label characterization information;

And the screening module is used for screening text characterization information with the maximum distribution probability and taking the text characterization information as keyword definition information of the text category.

Optionally, the acquiring module is specifically configured to:

Optionally, the analysis module is specifically configured to:

Optionally, the generating module is specifically configured to:

Optionally, the identification module is specifically configured to:

Optionally, the computing module is specifically configured to:

Optionally, the apparatus further includes:

the response module is used for responding to the keyword definition information uploading operation of the user, acquiring the actual keyword definition information of each text category, and identifying the keyword definition information of the text category and the deviation information between the actual keyword definition information of the text category for each text category;

and the adjustment module is used for calculating the ratio value between the deviation information and the keyword definition information, and adjusting the scoring parameters of the scoring function on the text category based on the deviation information under the condition that the ratio value is larger than a preset ratio threshold value to obtain a new scoring function.

In a third aspect, the present application provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the processor executes the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium. On which a computer program is stored which, when being executed by a processor, implements the steps of the method of any of the first aspects.

In a fifth aspect, the present application provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

The keyword definition method, the keyword definition device, the computer equipment, the storage medium and the computer program product are used for acquiring a plurality of text information needing classification definition and extracting keywords of each text information; analyzing the distribution probability of keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability; for each piece of text information, generating a feature vector of the text information based on a target keyword of the text information and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors; taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifying the text category corresponding to each text information; for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and respectively calculating distribution probability between each text characterization information and the tag characterization information; and screening text characterization information with the maximum distribution probability as keyword definition information of the text category. According to the scheme, the similarity between feature vectors of each text message is obtained by screening target keywords of each text message, so that text characterization information of each text message is calculated, then, the text category of each text message is identified, so that label characterization information corresponding to each text message of the same text category is calculated, then, the text characterization information with the largest distribution probability of the label characterization information is screened and used as keyword definition information of the text category, the manually defined low-precision problem is replaced by similarity identification among the characterization information, and the keyword definition information of the text category is screened by carrying out semantic extraction on the label characterization information corresponding to the text characterization information of each text message, so that the comprehensiveness of text features of each text message in the text category is summarized by the keyword definition information, and the keyword definition accuracy of text classification in the professional field is improved.

Drawings

FIG. 1 is a flow chart of a keyword definition method in one embodiment;

FIG. 2 is a flow diagram of an example of keyword definition in one embodiment;

FIG. 3 is a block diagram of a keyword definition apparatus in one embodiment;

fig. 4 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The keyword definition method provided by the embodiment of the application can be applied to an application environment for prompting text learning. The method can be applied to the terminal, the server and a system comprising the terminal and the server, and is realized through interaction of the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and the like. The terminal calculates text characterization information of each text information by screening similarity among feature vectors of each text information obtained by target keywords of each text information, then calculates tag characterization information corresponding to each text information of the same text category by identifying text category of each text information, and screens text characterization information with maximum distribution probability of the tag characterization information as keyword definition information of the text category, so that manually defined low-precision problems are replaced by similarity identification among the characterization information, and the keyword definition information of the text category is screened by carrying out semantic extraction on the tag characterization information corresponding to the text characterization information of each text information, so that comprehensiveness of text characteristics of each text information in the text category is improved, and keyword definition accuracy of text classification in the professional field is improved.

In one embodiment, as shown in fig. 1, a keyword definition method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S101, a plurality of text information needing classification definition is obtained, and keywords of each text information are extracted.

In this embodiment, a terminal responds to an information uploading operation with a user to obtain a plurality of text information needing classification definition, where text types among the text information are different, and the text types are departments to which the text information belongs and application functions of the text information are different. Then, the terminal extracts keywords of each text information, respectively. The specific extraction process will be described in detail later. The keyword is word information which can represent the semantics of the text information and has important meaning for understanding the text information. One text message contains a plurality of word information.

Step S102, analyzing the distribution probability of the keywords of each text message in the text message according to each text message, and screening the target keywords of the text message based on the distribution probability.

In this embodiment, the terminal analyzes, for each text message, a distribution probability of keywords of each text message in the text message, and screens target keywords of the text message based on the distribution probability. Specifically, a distribution probability threshold value is preset by the terminal, and then, the terminal screens keywords corresponding to the distribution probability larger than the distribution probability threshold value from the keywords of each text message, and the keywords are used as target keywords of the text message.

Step S103, generating feature vectors of the text information based on the target keywords of the text information and the distribution probability of each target keyword for each text information, and calculating the average similarity between each feature vector and other feature vectors respectively.

In this embodiment, the terminal generates a feature vector of the text information based on the target keyword of the text information and the distribution probability of each target keyword for each text information. The formula corresponding to the feature vector of the text information is as follows:

in the above-mentioned method, the step of,is a K-dimensional vector, each dimension representing the importance of the target keyword (i.e., the probability of the distribution of the keyword), which may be formed as a whole as an input X _i One feature vector in the hidden space, i, is the virtual number of the keyword.

Then, the terminal calculates the average similarity between each feature vector and other feature vectors, respectively. The specific average similarity calculation process will be described in detail later.

Step S104, the feature vector corresponding to the maximum average similarity is used as text characterization information of the text information, and the text category corresponding to each text information is identified.

In this embodiment, the terminal uses the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifies the text category corresponding to each text information. The text categories are preset in the text information of the terminal, and are obtained by classifying the text information related to the financial field according to the service requirement by staff.

Step S105, for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and calculating distribution probability between each text characterization information and the tag characterization information respectively.

In this embodiment, the terminal calculates, for each text category, tag characterization information of the text category based on text characterization information of each text information corresponding to the text category.

The specific calculation formula for calculating the tag characterization information is as follows:

in the above, I _l For the number of all text messages of text category l, C _l For the tag to be representative of the information,for each text message, i is the virtual number of the target keyword in the text message, and K is the dimension of each target keyword in the single text message.

Then, the terminal calculates the distribution probability between each text characterization information and the tag characterization information respectively. The specific calculation process will be described in detail later.

And S106, screening text characterization information with the maximum distribution probability as keyword definition information of the text category.

In this embodiment, the terminal screens text characterization information with the maximum distribution probability as keyword definition information of text category

Based on the scheme, the text characterization information of each text information is calculated by screening the similarity between the feature vectors of each text information obtained by the target keywords of each text information, then the tag characterization information corresponding to each text information of the same text category is calculated by identifying the text category of each text information, and then the text characterization information with the largest distribution probability of the tag characterization information is screened and used as the keyword definition information of the text category, so that the low-precision problem of manual definition is replaced by identifying the similarity between the characterization information, and the keyword definition information of the text category is screened by carrying out semantic refinement on the tag characterization information corresponding to the text characterization information of each text information, so that the comprehensiveness of text characteristics of each text information in the text category is promoted, and the keyword definition precision of text classification in the professional field is promoted.

Optionally, extracting the keyword of each text message includes: dividing the text information according to the speech segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying the semantic features of each text segment through a text feature identification network; calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment; and taking the sub-keywords of all the text segments as keywords of the text information.

In this embodiment, for each text message, the terminal divides the text message according to the speech segments included in the text message to obtain a plurality of text segments, and identifies the semantic features of each text segment through the text feature identification network. Wherein the text feature recognition network is a full convolutional neural network based on an attention mechanism. Then, the terminal calculates the similarity between the word information of each text segment and the semantic feature of each text segment. The similarity calculation method comprises the steps of calculating similarity distances between word information and semantic features through a similarity distance algorithm, sorting through the similarity distances between the word information and the semantic features to obtain a similarity distance sequence, calculating the similarity distance sequence position of all word information by the terminal according to each word information, and dividing the similarity distance sequence position by the number of all word information to obtain the similarity distance between the word information and the semantic features. The similarity distance algorithm may be, but is not limited to, euclidean distance, mahalanobis distance, or the like.

The terminal screens word information corresponding to the maximum similarity in each text segment to serve as sub-keywords of the text segment. And finally, the terminal takes the sub-keywords of all the text segments as keywords of the text information.

Based on the scheme, the semantic features of each text segment are identified, so that the keywords are screened, and the accuracy of text meaning of the text information can be represented by the screened keywords.

Optionally, analyzing the distribution probability of the keywords of each text message in the text message includes: identifying, for each keyword, the number of keywords in the text information and identifying the number of all word information of the text information; and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.

In this embodiment, the terminal identifies the number of keywords in the text information for each keyword, and identifies the number of all word information of the text information. And then, the terminal calculates the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.

Based on the scheme, the distribution probability of the keywords in the text information is calculated by identifying the number of the keywords in the text information, so that the accuracy of the calculated distribution probability is improved.

Optionally, calculating the average similarity between each feature vector and other feature vectors includes: the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors; for each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.

In this embodiment, the terminal calculates the cosine similarity between every two feature vectors and the loss function between every two feature vectors, and calculates the similarity between every two feature vectors based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors.

The cosine similarity algorithm has a calculation formula as follows:

in the above-mentioned method, the step of,for the i-th feature vector in the text information, is->The j-th feature vector in the text information is the virtual number of the feature vector, wherein i is not equal to j, and neither i nor j is the virtual number of the feature vector.

The loss function is defined based on contrast learning InfoNCE loss, and the calculation formula of the loss function is as follows:

in the above formula, τ is a temperature coefficient for adjusting the shape of the similarity distribution, the larger τ is, the smoother the distribution, the smaller τ is, the more the distribution is corrugated,for the i-th feature vector in the text information, is->The j-th feature vector in the text information is the virtual number of the feature vector, wherein i is not equal to j, and neither i nor j is the virtual number of the feature vector.

Then, the terminal sums the similarity between each feature vector and other feature vectors based on each feature vector and divides the sum by the number of all feature vectors to obtain the average similarity between each feature vector and other feature vectors.

Based on the scheme, the cosine similarity and the loss function are calculated, so that the similarity between the two obtained feature vectors is improved, and the accuracy of the calculated similarity is improved.

Optionally, identifying the text category corresponding to each text message includes: acquiring sample text categories, and respectively extracting category characteristic information of each sample text category; identifying text content corresponding to each category of characteristic information, and respectively calculating similarity distances between the target keywords of each text information and the text content corresponding to each category of characteristic information; and screening the sample text category of the category characteristic information corresponding to the minimum similarity distance for each text information, and taking the sample text category as the text category of the text information.

In this embodiment, the terminal obtains the sample text category, and extracts the category feature information of each sample text category respectively. The method for extracting the category characteristic information comprises the steps of identifying text information of a sample text category, and then identifying the characteristic information of the text information of the sample text category through a text characteristic identification network to obtain the category characteristic information.

The terminal identifies the text content corresponding to each category characteristic information, and calculates the similarity distance between the target keyword of each text information and the text content corresponding to each category characteristic information. The similarity distance algorithm is a Euclidean distance algorithm, a Markov distance algorithm or the like.

And the terminal screens sample text categories of category characteristic information corresponding to the minimum similarity distance according to each text information, and the sample text categories are used as text categories of the text information.

Based on the scheme, after the category characteristic information of the sample text category is extracted, the similarity distance between each text information and the text category is identified, so that the text category corresponding to the text information is screened, and the accuracy of the text category of the identified text information is improved.

Optionally, calculating the distribution probability between each text characterization information and the tag characterization information respectively includes: identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information; calculating the distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among the tag characterization information of each text characterization information and the tag characterization information.

In this embodiment, the terminal identifies each text characterization information, and location information in the text information corresponding to each text characterization information. The terminal calculates a similarity score value between each piece of text characterization information and each piece of label characterization information through a scoring function based on the position information corresponding to each piece of text characterization information and each piece of label characterization information. The calculation formula of the similarity score value between each piece of text characterization information and each piece of label characterization information obtained by the scoring function S is as follows:

in the above description, T is the position information and C of each text characterization information _l For the tag characterization information, S is a scoring function, and l is a set of all tag characterization information.

The terminal calculates the distribution probability among the tag characterization information of the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of the text category of the text information corresponding to each text characterization information and each tag characterization information. The calculation formula of the distribution probability is as follows:

In the above formula, L is the number of all text categories,for each text characterization information, a similarity score value between tag characterization information of a text category of the text information corresponding to each text characterization information,/for each text characterization information>And scoring the similarity between each text characterization information and each label characterization information.

Based on the scheme, the distribution probability between each piece of text characterization information and each piece of label characterization information is calculated by calculating the similarity score value, so that the accuracy of the calculated distribution probability between each piece of text characterization information and each piece of label characterization information is improved.

Optionally, the text characterization information with the maximum distribution probability is screened and used as the keyword definition information of the text category, and then the method further comprises the following steps: responding to the keyword definition information uploading operation of a user, acquiring actual keyword definition information of each text category, and identifying the keyword definition information of the text category and deviation information between the actual keyword definition information of the text category aiming at each text category; calculating a proportion value between the deviation information and the keyword definition information, and adjusting a scoring parameter of the scoring function to the text category based on the deviation information under the condition that the proportion value is larger than a preset proportion threshold value to obtain a new scoring function.

In this embodiment, the terminal responds to the keyword definition information uploading operation of the user, and obtains the actual keyword definition information of each text category. The actual keyword definition information of each text category is target text category screened by staff in all text categories, and then the staff performs manual keyword definition on each keyword of the text category to obtain each actual keyword definition information. And then, the terminal receives the actual keyword definition information of each target text category uploaded to the terminal by the staff.

The terminal identifies, for each text category, deviation information between keyword definition information of the text category and actual keyword definition information of the text category. Then, the terminal calculates a ratio value between the deviation information and the keyword definition information. The algorithm of the ratio value may be, but not limited to, a ratio value between the number of text fields corresponding to the deviation information and the number of text fields corresponding to all the keyword definition information. And then, presetting a proportion threshold value by the terminal, and adjusting scoring parameters of the scoring function to the text category based on the deviation information to obtain a new scoring function under the condition that the proportion value is larger than the preset proportion threshold value. The specific adjustment process is that the terminal identifies the ratio between the parameter change value corresponding to each scoring parameter of the scoring function and the number of the deviation text fields in the training database of the scoring function, then identifies the scoring parameter change value corresponding to the deviation information based on the number of the text fields of the deviation information, and then the terminal adds the scoring parameter change value to the scoring parameter of the text category of the scoring function to obtain a new scoring parameter.

Based on the scheme, the scoring parameters are adjusted through the actual keyword information uploaded by the user, so that the accuracy of the similarity scoring value between each piece of text characterization information and each piece of label characterization information calculated by the scoring parameters is improved.

The application also provides a keyword definition example, as shown in fig. 2, and the specific processing procedure includes the following steps:

step S201, a plurality of text information that needs to be defined by classification is acquired.

Step S202, dividing the text information according to the language segments contained in the text information to obtain a plurality of text segments, and identifying the semantic features of each text segment through a text feature identification network.

Step S203, the similarity between the word information of each text segment and the semantic feature of each text segment is calculated, and the word information corresponding to the maximum similarity in each text segment is screened and used as the sub-keyword of the text segment.

Step S204, the sub-keywords of all text segments are used as keywords of the text information.

Step S205, for each text information, identifies the number of keywords in the text information for each keyword, and identifies the number of all word information of the text information.

Step S206, calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.

Step S207, screening target keywords of the text information based on the distribution probability.

Step S208, for each text message, generates a feature vector of the text message based on the target keyword of the text message and the distribution probability of each target keyword.

Step S209, the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors.

Step S210, for each feature vector, calculating an average similarity between each feature vector and other feature vectors based on the similarity between the feature vector and other feature vectors.

Step S211, using the feature vector corresponding to the maximum average similarity as text characterization information of the text information.

Step S212, sample text categories are obtained, and category characteristic information of each sample text category is extracted respectively.

Step S213, identifying the text content corresponding to each category characteristic information, and respectively calculating the similarity distance between the target keyword of each text information and the text content corresponding to each category characteristic information.

In step S214, for each text message, a sample text category of the category feature information corresponding to the minimum similarity distance is screened as the text category of the text message.

Step S215, for each text category, calculating tag characterization information of the text category based on the text characterization information of each text information corresponding to the text category.

Step S216, identifying each piece of text characterization information, and calculating a similarity score value between each piece of text characterization information and each piece of tag characterization information through a scoring function based on the position information corresponding to each piece of text characterization information and each piece of tag characterization information.

Step S217, calculating, by a distribution probability algorithm, a distribution probability between each text characterization information and tag characterization information of the text category of the text information corresponding to each text characterization information based on the similarity score value between each text characterization information and tag characterization information of the text category of the text information corresponding to each text characterization information and the similarity score value between each text characterization information and each tag characterization information.

In step S218, text characterization information with the maximum distribution probability is screened as keyword definition information of the text category.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a keyword definition device for realizing the above related keyword definition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the one or more keyword definition devices provided below may refer to the limitation of the keyword definition method described above, and will not be repeated here.

In one embodiment, as shown in fig. 3, there is provided a keyword definition apparatus, including: acquisition module 310, analysis module 320, generation module 330, identification module 340, calculation module 350, and screening module 360, wherein:

an obtaining module 310, configured to obtain a plurality of text information that needs to be defined by classification, and extract keywords of each text information;

an analysis module 320, configured to analyze, for each text message, a distribution probability of a keyword of each text message in the text message, and screen a target keyword of the text message based on the distribution probability;

a generating module 330, configured to generate, for each piece of text information, a feature vector of the text information based on a target keyword of the text information and a distribution probability of each target keyword, and calculate an average similarity between each feature vector and other feature vectors;

the identifying module 340 is configured to use the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identify a text category corresponding to each text information;

a calculating module 350, configured to calculate, for each text category, tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and calculate a distribution probability between each text characterization information and the tag characterization information;

And the screening module 360 is configured to screen text characterization information with the maximum distribution probability as keyword definition information of the text category.

Optionally, the acquiring module 310 is specifically configured to:

Optionally, the analysis module 320 is specifically configured to:

Optionally, the generating module 330 is specifically configured to:

Optionally, the identifying module 340 is specifically configured to:

Optionally, the computing module 350 is specifically configured to:

Optionally, the apparatus further includes:

The respective modules in the above-described keyword definition means may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a keyword definition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory having a computer program stored therein and a processor that when executing the computer program

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of any of the first aspects.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A keyword definition method, the method comprising:

2. The method of claim 1, wherein extracting keywords for each text message comprises:

3. The method of claim 1, wherein analyzing the probability of distribution of keywords for each text message in the text message comprises:

4. The method of claim 1, wherein the calculating the average similarity between each feature vector and the other feature vectors comprises:

5. The method of claim 1, wherein identifying the text category to which each text message corresponds comprises:

6. The method of claim 1, wherein the separately calculating the probability of distribution between each text token and the tag token comprises:

7. The method of claim 6, wherein said screening text characterization information of the largest distribution probability, after being the keyword definition information of the text category, further comprises:

8. A keyword definition apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.