CN117493493A - Keyword definition method, keyword definition device, computer equipment and storage medium - Google Patents

Keyword definition method, keyword definition device, computer equipment and storage medium Download PDF

Info

Publication number
CN117493493A
CN117493493A CN202311615082.6A CN202311615082A CN117493493A CN 117493493 A CN117493493 A CN 117493493A CN 202311615082 A CN202311615082 A CN 202311615082A CN 117493493 A CN117493493 A CN 117493493A
Authority
CN
China
Prior art keywords
text
information
category
characterization
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311615082.6A
Other languages
Chinese (zh)
Inventor
邓维
杨恺
杨念梓
贾唯秦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202311615082.6A priority Critical patent/CN117493493A/en
Publication of CN117493493A publication Critical patent/CN117493493A/en
Pending legal-status Critical Current

Links

Abstract

The application relates to a keyword definition method, a keyword definition device, computer equipment and a storage medium. The application relates to the technical field of prompt learning and artificial intelligence. The method comprises the following steps: acquiring a plurality of text information needing classification definition, and extracting keywords of each text information; analyzing the distribution probability of the keywords of each text message aiming at each text message, thereby screening the target keywords of the text message; generating feature vectors of the text information based on target keywords of the text information and the distribution probability thereof aiming at each text information, respectively calculating the similarity between each feature vector so as to determine text characterization information, and calculating label characterization information of the text category aiming at each text category; and screening text characterization information with the maximum distribution probability between the text characterization information and the tag characterization information, and taking the text characterization information as keyword definition information of the text category. By adopting the method, the definition accuracy of the keywords of the text classification in the professional field can be improved.

Description

Keyword definition method, keyword definition device, computer equipment and storage medium
Technical Field
The present application relates to the field of prompt learning and artificial intelligence technologies, and in particular, to a keyword definition method, apparatus, computer device, and storage medium.
Background
The traditional prompt learning model is simpler to define keywords (verbalizer) for the classified text in the general field, but generally requires more field knowledge for text classification of data in the professional field (such as financial field), so that the definition of the keywords for text classification in the professional field is very difficult. Therefore, how to improve the definition accuracy of keywords of text classification in the professional field is the current research focus.
The traditional technical scheme is that keywords of text classification in the professional field are defined manually, but the method not only needs to consume a large amount of manpower, but also the defined keywords can not summarize the data characteristics of data in the text category, so that the definition accuracy of the keywords of the text classification in the professional field is lower.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a keyword definition method, apparatus, computer device, computer readable storage medium, and computer program product.
In a first aspect, the present application provides a keyword definition method. The method comprises the following steps:
acquiring a plurality of text information needing classification definition, and extracting keywords of each text information;
Analyzing the distribution probability of keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;
for each piece of text information, generating a feature vector of the text information based on a target keyword of the text information and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;
taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifying the text category corresponding to each text information;
for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and respectively calculating distribution probability between each text characterization information and the tag characterization information;
and screening text characterization information with the maximum distribution probability as keyword definition information of the text category.
Optionally, the extracting the keyword of each text message includes:
dividing the text information according to the language segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying semantic features of each text segment through a text feature identification network;
Calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment;
and taking the sub-keywords of all the text segments as keywords of the text information.
Optionally, the analyzing the distribution probability of the keywords of each text message in the text message includes:
identifying, for each keyword, the number of the keywords in the text information and identifying the number of all word information of the text information;
and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
Optionally, the calculating the average similarity between each feature vector and other feature vectors includes:
the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors;
For each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.
Optionally, the identifying the text category corresponding to each text message includes:
acquiring sample text categories, and respectively extracting category characteristic information of each sample text category;
identifying text content corresponding to each category characteristic information, and respectively calculating similarity distances between target keywords of each text information and the text content corresponding to each category characteristic information;
and screening a sample text category to which the category characteristic information corresponding to the minimum similarity distance belongs as the text category of the text information aiming at each text information.
Optionally, the calculating the distribution probability between each text characterization information and the tag characterization information includes:
identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information;
Calculating distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among each text characterization information and each tag characterization information.
Optionally, after the text characterization information with the maximum distribution probability is used as the keyword definition information of the text category, the method further includes:
acquiring actual keyword definition information of each text category in response to a keyword definition information uploading operation of a user, and identifying keyword definition information of the text category and deviation information between the actual keyword definition information of the text category for each text category;
and calculating a proportion value between the deviation information and the keyword definition information, and adjusting a scoring parameter of the scoring function to the text category based on the deviation information under the condition that the proportion value is larger than a preset proportion threshold value to obtain a new scoring function.
In a second aspect, the present application further provides a keyword definition apparatus. The device comprises:
the acquisition module is used for acquiring a plurality of text information needing classification definition and extracting keywords of each text information;
the analysis module is used for analyzing the distribution probability of the keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;
the generation module is used for generating a feature vector of each text message based on the target keyword of the text message and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;
the recognition module is used for taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information and recognizing the text category corresponding to each text information;
the computing module is used for computing label characterization information of each text category based on the text characterization information of each text information corresponding to the text category and respectively computing the distribution probability between each text characterization information and the label characterization information;
And the screening module is used for screening text characterization information with the maximum distribution probability and taking the text characterization information as keyword definition information of the text category.
Optionally, the acquiring module is specifically configured to:
dividing the text information according to the language segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying semantic features of each text segment through a text feature identification network;
calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment;
and taking the sub-keywords of all the text segments as keywords of the text information.
Optionally, the analysis module is specifically configured to:
identifying, for each keyword, the number of the keywords in the text information and identifying the number of all word information of the text information;
and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
Optionally, the generating module is specifically configured to:
The cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors;
for each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.
Optionally, the identification module is specifically configured to:
acquiring sample text categories, and respectively extracting category characteristic information of each sample text category;
identifying text content corresponding to each category characteristic information, and respectively calculating similarity distances between target keywords of each text information and the text content corresponding to each category characteristic information;
and screening a sample text category to which the category characteristic information corresponding to the minimum similarity distance belongs as the text category of the text information aiming at each text information.
Optionally, the computing module is specifically configured to:
identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information;
Calculating distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among each text characterization information and each tag characterization information.
Optionally, the apparatus further includes:
the response module is used for responding to the keyword definition information uploading operation of the user, acquiring the actual keyword definition information of each text category, and identifying the keyword definition information of the text category and the deviation information between the actual keyword definition information of the text category for each text category;
and the adjustment module is used for calculating the ratio value between the deviation information and the keyword definition information, and adjusting the scoring parameters of the scoring function on the text category based on the deviation information under the condition that the ratio value is larger than a preset ratio threshold value to obtain a new scoring function.
In a third aspect, the present application provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the processor executes the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium. On which a computer program is stored which, when being executed by a processor, implements the steps of the method of any of the first aspects.
In a fifth aspect, the present application provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.
The keyword definition method, the keyword definition device, the computer equipment, the storage medium and the computer program product are used for acquiring a plurality of text information needing classification definition and extracting keywords of each text information; analyzing the distribution probability of keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability; for each piece of text information, generating a feature vector of the text information based on a target keyword of the text information and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors; taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifying the text category corresponding to each text information; for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and respectively calculating distribution probability between each text characterization information and the tag characterization information; and screening text characterization information with the maximum distribution probability as keyword definition information of the text category. According to the scheme, the similarity between feature vectors of each text message is obtained by screening target keywords of each text message, so that text characterization information of each text message is calculated, then, the text category of each text message is identified, so that label characterization information corresponding to each text message of the same text category is calculated, then, the text characterization information with the largest distribution probability of the label characterization information is screened and used as keyword definition information of the text category, the manually defined low-precision problem is replaced by similarity identification among the characterization information, and the keyword definition information of the text category is screened by carrying out semantic extraction on the label characterization information corresponding to the text characterization information of each text message, so that the comprehensiveness of text features of each text message in the text category is summarized by the keyword definition information, and the keyword definition accuracy of text classification in the professional field is improved.
Drawings
FIG. 1 is a flow chart of a keyword definition method in one embodiment;
FIG. 2 is a flow diagram of an example of keyword definition in one embodiment;
FIG. 3 is a block diagram of a keyword definition apparatus in one embodiment;
fig. 4 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The keyword definition method provided by the embodiment of the application can be applied to an application environment for prompting text learning. The method can be applied to the terminal, the server and a system comprising the terminal and the server, and is realized through interaction of the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and the like. The terminal calculates text characterization information of each text information by screening similarity among feature vectors of each text information obtained by target keywords of each text information, then calculates tag characterization information corresponding to each text information of the same text category by identifying text category of each text information, and screens text characterization information with maximum distribution probability of the tag characterization information as keyword definition information of the text category, so that manually defined low-precision problems are replaced by similarity identification among the characterization information, and the keyword definition information of the text category is screened by carrying out semantic extraction on the tag characterization information corresponding to the text characterization information of each text information, so that comprehensiveness of text characteristics of each text information in the text category is improved, and keyword definition accuracy of text classification in the professional field is improved.
In one embodiment, as shown in fig. 1, a keyword definition method is provided, and the method is applied to a terminal for illustration, and includes the following steps:
step S101, a plurality of text information needing classification definition is obtained, and keywords of each text information are extracted.
In this embodiment, a terminal responds to an information uploading operation with a user to obtain a plurality of text information needing classification definition, where text types among the text information are different, and the text types are departments to which the text information belongs and application functions of the text information are different. Then, the terminal extracts keywords of each text information, respectively. The specific extraction process will be described in detail later. The keyword is word information which can represent the semantics of the text information and has important meaning for understanding the text information. One text message contains a plurality of word information.
Step S102, analyzing the distribution probability of the keywords of each text message in the text message according to each text message, and screening the target keywords of the text message based on the distribution probability.
In this embodiment, the terminal analyzes, for each text message, a distribution probability of keywords of each text message in the text message, and screens target keywords of the text message based on the distribution probability. Specifically, a distribution probability threshold value is preset by the terminal, and then, the terminal screens keywords corresponding to the distribution probability larger than the distribution probability threshold value from the keywords of each text message, and the keywords are used as target keywords of the text message.
Step S103, generating feature vectors of the text information based on the target keywords of the text information and the distribution probability of each target keyword for each text information, and calculating the average similarity between each feature vector and other feature vectors respectively.
In this embodiment, the terminal generates a feature vector of the text information based on the target keyword of the text information and the distribution probability of each target keyword for each text information. The formula corresponding to the feature vector of the text information is as follows:
in the above-mentioned method, the step of,is a K-dimensional vector, each dimension representing the importance of the target keyword (i.e., the probability of the distribution of the keyword), which may be formed as a whole as an input X i One feature vector in the hidden space, i, is the virtual number of the keyword.
Then, the terminal calculates the average similarity between each feature vector and other feature vectors, respectively. The specific average similarity calculation process will be described in detail later.
Step S104, the feature vector corresponding to the maximum average similarity is used as text characterization information of the text information, and the text category corresponding to each text information is identified.
In this embodiment, the terminal uses the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifies the text category corresponding to each text information. The text categories are preset in the text information of the terminal, and are obtained by classifying the text information related to the financial field according to the service requirement by staff.
Step S105, for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and calculating distribution probability between each text characterization information and the tag characterization information respectively.
In this embodiment, the terminal calculates, for each text category, tag characterization information of the text category based on text characterization information of each text information corresponding to the text category.
The specific calculation formula for calculating the tag characterization information is as follows:
in the above, I l For the number of all text messages of text category l, C l For the tag to be representative of the information,for each text message, i is the virtual number of the target keyword in the text message, and K is the dimension of each target keyword in the single text message.
Then, the terminal calculates the distribution probability between each text characterization information and the tag characterization information respectively. The specific calculation process will be described in detail later.
And S106, screening text characterization information with the maximum distribution probability as keyword definition information of the text category.
In this embodiment, the terminal screens text characterization information with the maximum distribution probability as keyword definition information of text category
Based on the scheme, the text characterization information of each text information is calculated by screening the similarity between the feature vectors of each text information obtained by the target keywords of each text information, then the tag characterization information corresponding to each text information of the same text category is calculated by identifying the text category of each text information, and then the text characterization information with the largest distribution probability of the tag characterization information is screened and used as the keyword definition information of the text category, so that the low-precision problem of manual definition is replaced by identifying the similarity between the characterization information, and the keyword definition information of the text category is screened by carrying out semantic refinement on the tag characterization information corresponding to the text characterization information of each text information, so that the comprehensiveness of text characteristics of each text information in the text category is promoted, and the keyword definition precision of text classification in the professional field is promoted.
Optionally, extracting the keyword of each text message includes: dividing the text information according to the speech segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying the semantic features of each text segment through a text feature identification network; calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment; and taking the sub-keywords of all the text segments as keywords of the text information.
In this embodiment, for each text message, the terminal divides the text message according to the speech segments included in the text message to obtain a plurality of text segments, and identifies the semantic features of each text segment through the text feature identification network. Wherein the text feature recognition network is a full convolutional neural network based on an attention mechanism. Then, the terminal calculates the similarity between the word information of each text segment and the semantic feature of each text segment. The similarity calculation method comprises the steps of calculating similarity distances between word information and semantic features through a similarity distance algorithm, sorting through the similarity distances between the word information and the semantic features to obtain a similarity distance sequence, calculating the similarity distance sequence position of all word information by the terminal according to each word information, and dividing the similarity distance sequence position by the number of all word information to obtain the similarity distance between the word information and the semantic features. The similarity distance algorithm may be, but is not limited to, euclidean distance, mahalanobis distance, or the like.
The terminal screens word information corresponding to the maximum similarity in each text segment to serve as sub-keywords of the text segment. And finally, the terminal takes the sub-keywords of all the text segments as keywords of the text information.
Based on the scheme, the semantic features of each text segment are identified, so that the keywords are screened, and the accuracy of text meaning of the text information can be represented by the screened keywords.
Optionally, analyzing the distribution probability of the keywords of each text message in the text message includes: identifying, for each keyword, the number of keywords in the text information and identifying the number of all word information of the text information; and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
In this embodiment, the terminal identifies the number of keywords in the text information for each keyword, and identifies the number of all word information of the text information. And then, the terminal calculates the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
Based on the scheme, the distribution probability of the keywords in the text information is calculated by identifying the number of the keywords in the text information, so that the accuracy of the calculated distribution probability is improved.
Optionally, calculating the average similarity between each feature vector and other feature vectors includes: the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors; for each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.
In this embodiment, the terminal calculates the cosine similarity between every two feature vectors and the loss function between every two feature vectors, and calculates the similarity between every two feature vectors based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors.
The cosine similarity algorithm has a calculation formula as follows:
in the above-mentioned method, the step of,for the i-th feature vector in the text information, is->The j-th feature vector in the text information is the virtual number of the feature vector, wherein i is not equal to j, and neither i nor j is the virtual number of the feature vector.
The loss function is defined based on contrast learning InfoNCE loss, and the calculation formula of the loss function is as follows:
in the above formula, τ is a temperature coefficient for adjusting the shape of the similarity distribution, the larger τ is, the smoother the distribution, the smaller τ is, the more the distribution is corrugated,for the i-th feature vector in the text information, is->The j-th feature vector in the text information is the virtual number of the feature vector, wherein i is not equal to j, and neither i nor j is the virtual number of the feature vector.
Then, the terminal sums the similarity between each feature vector and other feature vectors based on each feature vector and divides the sum by the number of all feature vectors to obtain the average similarity between each feature vector and other feature vectors.
Based on the scheme, the cosine similarity and the loss function are calculated, so that the similarity between the two obtained feature vectors is improved, and the accuracy of the calculated similarity is improved.
Optionally, identifying the text category corresponding to each text message includes: acquiring sample text categories, and respectively extracting category characteristic information of each sample text category; identifying text content corresponding to each category of characteristic information, and respectively calculating similarity distances between the target keywords of each text information and the text content corresponding to each category of characteristic information; and screening the sample text category of the category characteristic information corresponding to the minimum similarity distance for each text information, and taking the sample text category as the text category of the text information.
In this embodiment, the terminal obtains the sample text category, and extracts the category feature information of each sample text category respectively. The method for extracting the category characteristic information comprises the steps of identifying text information of a sample text category, and then identifying the characteristic information of the text information of the sample text category through a text characteristic identification network to obtain the category characteristic information.
The terminal identifies the text content corresponding to each category characteristic information, and calculates the similarity distance between the target keyword of each text information and the text content corresponding to each category characteristic information. The similarity distance algorithm is a Euclidean distance algorithm, a Markov distance algorithm or the like.
And the terminal screens sample text categories of category characteristic information corresponding to the minimum similarity distance according to each text information, and the sample text categories are used as text categories of the text information.
Based on the scheme, after the category characteristic information of the sample text category is extracted, the similarity distance between each text information and the text category is identified, so that the text category corresponding to the text information is screened, and the accuracy of the text category of the identified text information is improved.
Optionally, calculating the distribution probability between each text characterization information and the tag characterization information respectively includes: identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information; calculating the distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among the tag characterization information of each text characterization information and the tag characterization information.
In this embodiment, the terminal identifies each text characterization information, and location information in the text information corresponding to each text characterization information. The terminal calculates a similarity score value between each piece of text characterization information and each piece of label characterization information through a scoring function based on the position information corresponding to each piece of text characterization information and each piece of label characterization information. The calculation formula of the similarity score value between each piece of text characterization information and each piece of label characterization information obtained by the scoring function S is as follows:
in the above description, T is the position information and C of each text characterization information l For the tag characterization information, S is a scoring function, and l is a set of all tag characterization information.
The terminal calculates the distribution probability among the tag characterization information of the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of the text category of the text information corresponding to each text characterization information and each tag characterization information. The calculation formula of the distribution probability is as follows:
In the above formula, L is the number of all text categories,for each text characterization information, a similarity score value between tag characterization information of a text category of the text information corresponding to each text characterization information,/for each text characterization information>And scoring the similarity between each text characterization information and each label characterization information.
Based on the scheme, the distribution probability between each piece of text characterization information and each piece of label characterization information is calculated by calculating the similarity score value, so that the accuracy of the calculated distribution probability between each piece of text characterization information and each piece of label characterization information is improved.
Optionally, the text characterization information with the maximum distribution probability is screened and used as the keyword definition information of the text category, and then the method further comprises the following steps: responding to the keyword definition information uploading operation of a user, acquiring actual keyword definition information of each text category, and identifying the keyword definition information of the text category and deviation information between the actual keyword definition information of the text category aiming at each text category; calculating a proportion value between the deviation information and the keyword definition information, and adjusting a scoring parameter of the scoring function to the text category based on the deviation information under the condition that the proportion value is larger than a preset proportion threshold value to obtain a new scoring function.
In this embodiment, the terminal responds to the keyword definition information uploading operation of the user, and obtains the actual keyword definition information of each text category. The actual keyword definition information of each text category is target text category screened by staff in all text categories, and then the staff performs manual keyword definition on each keyword of the text category to obtain each actual keyword definition information. And then, the terminal receives the actual keyword definition information of each target text category uploaded to the terminal by the staff.
The terminal identifies, for each text category, deviation information between keyword definition information of the text category and actual keyword definition information of the text category. Then, the terminal calculates a ratio value between the deviation information and the keyword definition information. The algorithm of the ratio value may be, but not limited to, a ratio value between the number of text fields corresponding to the deviation information and the number of text fields corresponding to all the keyword definition information. And then, presetting a proportion threshold value by the terminal, and adjusting scoring parameters of the scoring function to the text category based on the deviation information to obtain a new scoring function under the condition that the proportion value is larger than the preset proportion threshold value. The specific adjustment process is that the terminal identifies the ratio between the parameter change value corresponding to each scoring parameter of the scoring function and the number of the deviation text fields in the training database of the scoring function, then identifies the scoring parameter change value corresponding to the deviation information based on the number of the text fields of the deviation information, and then the terminal adds the scoring parameter change value to the scoring parameter of the text category of the scoring function to obtain a new scoring parameter.
Based on the scheme, the scoring parameters are adjusted through the actual keyword information uploaded by the user, so that the accuracy of the similarity scoring value between each piece of text characterization information and each piece of label characterization information calculated by the scoring parameters is improved.
The application also provides a keyword definition example, as shown in fig. 2, and the specific processing procedure includes the following steps:
step S201, a plurality of text information that needs to be defined by classification is acquired.
Step S202, dividing the text information according to the language segments contained in the text information to obtain a plurality of text segments, and identifying the semantic features of each text segment through a text feature identification network.
Step S203, the similarity between the word information of each text segment and the semantic feature of each text segment is calculated, and the word information corresponding to the maximum similarity in each text segment is screened and used as the sub-keyword of the text segment.
Step S204, the sub-keywords of all text segments are used as keywords of the text information.
Step S205, for each text information, identifies the number of keywords in the text information for each keyword, and identifies the number of all word information of the text information.
Step S206, calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
Step S207, screening target keywords of the text information based on the distribution probability.
Step S208, for each text message, generates a feature vector of the text message based on the target keyword of the text message and the distribution probability of each target keyword.
Step S209, the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors.
Step S210, for each feature vector, calculating an average similarity between each feature vector and other feature vectors based on the similarity between the feature vector and other feature vectors.
Step S211, using the feature vector corresponding to the maximum average similarity as text characterization information of the text information.
Step S212, sample text categories are obtained, and category characteristic information of each sample text category is extracted respectively.
Step S213, identifying the text content corresponding to each category characteristic information, and respectively calculating the similarity distance between the target keyword of each text information and the text content corresponding to each category characteristic information.
In step S214, for each text message, a sample text category of the category feature information corresponding to the minimum similarity distance is screened as the text category of the text message.
Step S215, for each text category, calculating tag characterization information of the text category based on the text characterization information of each text information corresponding to the text category.
Step S216, identifying each piece of text characterization information, and calculating a similarity score value between each piece of text characterization information and each piece of tag characterization information through a scoring function based on the position information corresponding to each piece of text characterization information and each piece of tag characterization information.
Step S217, calculating, by a distribution probability algorithm, a distribution probability between each text characterization information and tag characterization information of the text category of the text information corresponding to each text characterization information based on the similarity score value between each text characterization information and tag characterization information of the text category of the text information corresponding to each text characterization information and the similarity score value between each text characterization information and each tag characterization information.
In step S218, text characterization information with the maximum distribution probability is screened as keyword definition information of the text category.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a keyword definition device for realizing the above related keyword definition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the one or more keyword definition devices provided below may refer to the limitation of the keyword definition method described above, and will not be repeated here.
In one embodiment, as shown in fig. 3, there is provided a keyword definition apparatus, including: acquisition module 310, analysis module 320, generation module 330, identification module 340, calculation module 350, and screening module 360, wherein:
an obtaining module 310, configured to obtain a plurality of text information that needs to be defined by classification, and extract keywords of each text information;
an analysis module 320, configured to analyze, for each text message, a distribution probability of a keyword of each text message in the text message, and screen a target keyword of the text message based on the distribution probability;
a generating module 330, configured to generate, for each piece of text information, a feature vector of the text information based on a target keyword of the text information and a distribution probability of each target keyword, and calculate an average similarity between each feature vector and other feature vectors;
the identifying module 340 is configured to use the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identify a text category corresponding to each text information;
a calculating module 350, configured to calculate, for each text category, tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and calculate a distribution probability between each text characterization information and the tag characterization information;
And the screening module 360 is configured to screen text characterization information with the maximum distribution probability as keyword definition information of the text category.
Optionally, the acquiring module 310 is specifically configured to:
dividing the text information according to the language segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying semantic features of each text segment through a text feature identification network;
calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment;
and taking the sub-keywords of all the text segments as keywords of the text information.
Optionally, the analysis module 320 is specifically configured to:
identifying, for each keyword, the number of the keywords in the text information and identifying the number of all word information of the text information;
and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
Optionally, the generating module 330 is specifically configured to:
The cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors;
for each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.
Optionally, the identifying module 340 is specifically configured to:
acquiring sample text categories, and respectively extracting category characteristic information of each sample text category;
identifying text content corresponding to each category characteristic information, and respectively calculating similarity distances between target keywords of each text information and the text content corresponding to each category characteristic information;
and screening a sample text category to which the category characteristic information corresponding to the minimum similarity distance belongs as the text category of the text information aiming at each text information.
Optionally, the computing module 350 is specifically configured to:
identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information;
Calculating distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among each text characterization information and each tag characterization information.
Optionally, the apparatus further includes:
the response module is used for responding to the keyword definition information uploading operation of the user, acquiring the actual keyword definition information of each text category, and identifying the keyword definition information of the text category and the deviation information between the actual keyword definition information of the text category for each text category;
and the adjustment module is used for calculating the ratio value between the deviation information and the keyword definition information, and adjusting the scoring parameters of the scoring function on the text category based on the deviation information under the condition that the ratio value is larger than a preset ratio threshold value to obtain a new scoring function.
The respective modules in the above-described keyword definition means may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a keyword definition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory having a computer program stored therein and a processor that when executing the computer program
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of any of the first aspects.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.
It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A keyword definition method, the method comprising:
acquiring a plurality of text information needing classification definition, and extracting keywords of each text information;
analyzing the distribution probability of keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;
For each piece of text information, generating a feature vector of the text information based on a target keyword of the text information and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;
taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information, and identifying the text category corresponding to each text information;
for each text category, calculating tag characterization information of the text category based on text characterization information of each text information corresponding to the text category, and respectively calculating distribution probability between each text characterization information and the tag characterization information;
and screening text characterization information with the maximum distribution probability as keyword definition information of the text category.
2. The method of claim 1, wherein extracting keywords for each text message comprises:
dividing the text information according to the language segments contained in the text information aiming at each text information to obtain a plurality of text segments, and identifying semantic features of each text segment through a text feature identification network;
Calculating the similarity between each word information of each text segment and the semantic feature of each text segment, and screening the word information corresponding to the maximum similarity in each text segment as a sub-keyword of the text segment;
and taking the sub-keywords of all the text segments as keywords of the text information.
3. The method of claim 1, wherein analyzing the probability of distribution of keywords for each text message in the text message comprises:
identifying, for each keyword, the number of the keywords in the text information and identifying the number of all word information of the text information;
and calculating the ratio of the number of the keywords to the number of all word information to obtain the distribution probability of the keywords in the text information.
4. The method of claim 1, wherein the calculating the average similarity between each feature vector and the other feature vectors comprises:
the cosine similarity between every two feature vectors and the loss function between every two feature vectors are calculated respectively, and the similarity between every two feature vectors is calculated based on the cosine similarity between every two feature vectors and the loss function between every two feature vectors;
For each feature vector, an average similarity between each feature vector and other feature vectors is calculated based on the similarity between the feature vector and other feature vectors.
5. The method of claim 1, wherein identifying the text category to which each text message corresponds comprises:
acquiring sample text categories, and respectively extracting category characteristic information of each sample text category;
identifying text content corresponding to each category characteristic information, and respectively calculating similarity distances between target keywords of each text information and the text content corresponding to each category characteristic information;
and screening a sample text category to which the category characteristic information corresponding to the minimum similarity distance belongs as the text category of the text information aiming at each text information.
6. The method of claim 1, wherein the separately calculating the probability of distribution between each text token and the tag token comprises:
identifying each piece of text characterization information, positioning information in the text information corresponding to each piece of text characterization information, and calculating a similarity grading value between each piece of text characterization information and each piece of tag characterization information through a grading function based on the positioning information corresponding to each piece of text characterization information and each piece of tag characterization information;
Calculating distribution probability among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information through a distribution probability algorithm based on the similarity score value among the tag characterization information of each text characterization information and the text category of the text information corresponding to each text characterization information and the similarity score value among each text characterization information and each tag characterization information.
7. The method of claim 6, wherein said screening text characterization information of the largest distribution probability, after being the keyword definition information of the text category, further comprises:
acquiring actual keyword definition information of each text category in response to a keyword definition information uploading operation of a user, and identifying keyword definition information of the text category and deviation information between the actual keyword definition information of the text category for each text category;
and calculating a proportion value between the deviation information and the keyword definition information, and adjusting a scoring parameter of the scoring function to the text category based on the deviation information under the condition that the proportion value is larger than a preset proportion threshold value to obtain a new scoring function.
8. A keyword definition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a plurality of text information needing classification definition and extracting keywords of each text information;
the analysis module is used for analyzing the distribution probability of the keywords of each text message in the text message aiming at each text message, and screening target keywords of the text message based on the distribution probability;
the generation module is used for generating a feature vector of each text message based on the target keyword of the text message and the distribution probability of each target keyword, and respectively calculating the average similarity between each feature vector and other feature vectors;
the recognition module is used for taking the feature vector corresponding to the maximum average similarity as text characterization information of the text information and recognizing the text category corresponding to each text information;
the computing module is used for computing label characterization information of each text category based on the text characterization information of each text information corresponding to the text category and respectively computing the distribution probability between each text characterization information and the label characterization information;
And the screening module is used for screening text characterization information with the maximum distribution probability and taking the text characterization information as keyword definition information of the text category.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202311615082.6A 2023-11-29 2023-11-29 Keyword definition method, keyword definition device, computer equipment and storage medium Pending CN117493493A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311615082.6A CN117493493A (en) 2023-11-29 2023-11-29 Keyword definition method, keyword definition device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311615082.6A CN117493493A (en) 2023-11-29 2023-11-29 Keyword definition method, keyword definition device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117493493A true CN117493493A (en) 2024-02-02

Family

ID=89680074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311615082.6A Pending CN117493493A (en) 2023-11-29 2023-11-29 Keyword definition method, keyword definition device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117493493A (en)

Similar Documents

Publication Publication Date Title
CN110347835B (en) Text clustering method, electronic device and storage medium
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
Dey Sarkar et al. A novel feature selection technique for text classification using Naive Bayes
CN112528025A (en) Text clustering method, device and equipment based on density and storage medium
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN110880006A (en) User classification method and device, computer equipment and storage medium
CN114841161A (en) Event element extraction method, device, equipment, storage medium and program product
CN112131261A (en) Community query method and device based on community network and computer equipment
CN116108836B (en) Text emotion recognition method and device, computer equipment and readable storage medium
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN115409111A (en) Training method of named entity recognition model and named entity recognition method
Liu et al. Margin-based two-stage supervised hashing for image retrieval
Chu et al. Social-guided representation learning for images via deep heterogeneous hypergraph embedding
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
CN117493493A (en) Keyword definition method, keyword definition device, computer equipment and storage medium
CN113779248A (en) Data classification model training method, data processing method and storage medium
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
Mishra et al. Histogram of oriented gradients-based digit classification using naive Bayesian classifier
CN112434136B (en) Sex classification method, apparatus, electronic device and computer storage medium
CN116304755A (en) Streaming news clustering method and device and computer equipment
CN117612181A (en) Image recognition method, device, computer equipment and storage medium
CN114548242A (en) User tag identification method, device, electronic equipment and computer readable storage medium
CN116860972A (en) Interactive information classification method, device, apparatus, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination