CN113971407B - Semantic feature extraction method and computer-readable storage medium - Google Patents

Semantic feature extraction method and computer-readable storage medium Download PDF

Info

Publication number
CN113971407B
CN113971407B CN202111589256.7A CN202111589256A CN113971407B CN 113971407 B CN113971407 B CN 113971407B CN 202111589256 A CN202111589256 A CN 202111589256A CN 113971407 B CN113971407 B CN 113971407B
Authority
CN
China
Prior art keywords
word
words
feature
text data
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111589256.7A
Other languages
Chinese (zh)
Other versions
CN113971407A (en
Inventor
刘国清
杨广
王启程
郑伟
杜佩佩
杨国武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjia Innovation Technology Co ltd
Original Assignee
Shenzhen Minieye Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Minieye Innovation Technology Co Ltd filed Critical Shenzhen Minieye Innovation Technology Co Ltd
Priority to CN202111589256.7A priority Critical patent/CN113971407B/en
Publication of CN113971407A publication Critical patent/CN113971407A/en
Application granted granted Critical
Publication of CN113971407B publication Critical patent/CN113971407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a semantic feature extraction method, which comprises the following steps: acquiring text data of a plurality of categories, wherein the text data of each category comprises a plurality of words; calculating the relevance of each word and each category in all text data; selecting partial words from the text data as candidate words according to the relevance; calculating mutual information between preset characteristic words and candidate words in a preset characteristic word bank, wherein the preset characteristic words in the preset characteristic word bank are used for describing the categories of the text data; selecting partial candidate words according to the mutual information and adding the partial candidate words into a preset characteristic word bank to form a category semantic word bank; performing mask processing on the text data according to the category semantic word bank to obtain a mask text; training a BERT model according to the mask text to obtain a semantic feature extraction model; and inputting the text data of each category into a semantic feature extraction model to obtain a corresponding semantic feature vector. The technical scheme of the invention is used for extracting semantic feature vectors of various categories.

Description

Semantic feature extraction method and computer-readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a semantic feature extraction method and a computer-readable storage medium.
Background
During zero sample learning, images of classes that the model never seen at the time of training, i.e., images of unseen classes, may occur. The zero sample learning aims to identify the bridge of the unseen class by taking the trained semantic description information of the known class and the predicted semantic description information of the unseen class as a model, so that the model can identify the image of the unseen class by combining the semantic description information of the unseen class and the image features learned from the known class even though the model does not see the image of the unseen class.
Therefore, how to obtain semantic description information of a category is an urgent problem to be solved.
Disclosure of Invention
The invention provides a semantic feature extraction method and a computer-readable storage medium, which are used for extracting semantic feature vectors of various categories.
In a first aspect, an embodiment of the present invention provides a semantic feature extraction method, where the semantic feature extraction method includes:
acquiring text data of a plurality of categories, wherein the text data of each category comprises a plurality of words;
calculating the relevance of each word in all the text data and each category;
selecting partial words from the text data as candidate words according to the relevance;
calculating mutual information between preset characteristic words in a preset characteristic word bank and the candidate words, wherein the preset characteristic words in the preset characteristic word bank are used for describing the types of the text data;
selecting partial candidate words according to the mutual information and adding the partial candidate words into the preset feature word library to form a category semantic word library;
performing mask processing on the text data according to the category semantic word bank to obtain a mask text;
training a BERT model according to the mask text to obtain a semantic feature extraction model; and
and inputting the text data of each category into the semantic feature extraction model to obtain a corresponding semantic feature vector.
In a second aspect, embodiments of the present invention provide a computer-readable storage medium for storing program instructions executable by a processor to implement a semantic feature extraction method as described above.
According to the semantic feature extraction method and the computer-readable storage medium, partial words are selected from the text data as candidate words according to the relevance of each word and each category in the text data of a plurality of categories, and partial candidate words are selected and added into the preset feature word bank according to the mutual information of the candidate words and the preset feature words in the preset feature word bank so as to expand the preset feature word bank to be the category semantic word bank, so that the category semantic word bank has the advantages of being rich in variety and wide in coverage. Performing mask processing on text data according to the category semantic word bank to obtain a mask text, and training a BERT model according to the mask text, so that the BERT model can focus on words corresponding to a semantic feature word bank during semantic representation of a coded text, and the coding of the text semantic representation by the BERT model focuses on feature description of categories instead of semantics of a common text. The semantic feature vectors obtained by inputting the text data into the semantic feature extraction model can replace manual labeling, so that the workload of manual labeling is greatly reduced, and zero sample learning can be trained on a large label-free data set. Meanwhile, the characteristic description learned by the BERT model is implicit by utilizing rich semantic knowledge implied by text data of a plurality of categories, so that the BERT model has good adaptability to different categories, and the problem of field drift caused by manual labeling is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is a flowchart of a semantic feature extraction method according to an embodiment of the present invention.
Fig. 2 is a first sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention.
Fig. 3 is a second sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention.
Fig. 4 is a third sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention.
Fig. 5 is a fourth sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention.
Fig. 6 is a schematic internal structure diagram of a terminal according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances, in other words that the embodiments described are to be practiced in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, may also include other things, such as processes, methods, systems, articles, or apparatus that comprise a list of steps or elements is not necessarily limited to only those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such processes, methods, articles, or apparatus.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Please refer to fig. 1, which is a flowchart illustrating a semantic feature extraction method according to an embodiment of the present invention. The semantic feature extraction method is used for extracting semantic feature vectors of text data of all classes. The semantic feature extraction method specifically comprises the following steps.
Step S102, acquiring text data of a plurality of categories. Wherein each category of text data comprises a number of words. In the present embodiment, the category name of the image data set is used as a search term, and a description text of each category is crawled in the network encyclopedia according to the search term. The image data set includes, but is not limited to, ImageNet, PASCAL VOC, COCO, and the like. Web encyclopedias include, but are not limited to, wikipedia, encyclopedia, and the like. Preferably, the category names in the ImageNet are used as search words, and description texts of corresponding categories are crawled in Wikipedia. And preprocessing the description texts of all the categories to obtain text data of each category. Specifically, a word segmentation device is used for carrying out word segmentation on the description texts of each category, and stop words and special characters in the description texts are deleted to obtain text data. The word segmenter includes, but is not limited to, a jieba word segmenter, an ansj word segmenter, a Hanlp word segmenter, and the like. It will be appreciated that several categories include all categories in the image dataset, and that text data for each category may be used to describe the respective category. For example, with the category "zebra" as a search term, the description text crawled in wikipedia includes: "Zebras area grass-applying analytes in parts of Africa, best facial for the same connecting strips, important floor boards, floor board. The text data obtained by preprocessing the description text comprises: "Zebras grass-cutting animals in parts of Africa, facial cutting strips, fibrous blocks, muscle speed strips, use power full jaws, facial cutting strips, facial strips.
Step S104, calculating the relevance of each word and each category in all text data. In this embodiment, each word in all the categories of text data is numbered, and a total number mapping table of all the words and corresponding numbers is obtained. Wherein, the same word in all categories has the same number, and each category comprises a corresponding sub-number mapping table. And counting the number of times of each word appearing in the text data of each category, and recording the corresponding times to form a total times mapping table of the number and the times. Wherein each category includes a corresponding sub-degree mapping table. For example, three categories, category a, category B, and category C, are common. The text data of the category A comprises words a, words C and words d, the text data of the category B comprises words a, words B and words e, and the text data of the category C comprises words B, words C and words C. And numbering the word a, the word b, the word c, the word d and the word e to obtain a total numbering mapping table of 1-word a, 2-word b, 3-word c, 4-word d and 5-word e. Correspondingly, the sub-number mapping table of the category A is 1-word a, 3-word C, 4-word d, the sub-number mapping table of the category B is 1-word a, 2-word B, 5-word e, and the sub-number mapping table of the category C is 2-word B, 3-word C. Counting the times of each word appearing in all text data, wherein the word a appears in a category A and a category B, the word B appears in a category B and a category C, the word C appears in a category A and a category C, the word d appears in a category A, and the word e appears in a category B, and then the total times mapping table is 1-2, 2-2, 3-2, 4-1 and 5-1. Correspondingly, the sub-times mapping table of the category A is 1-2, 3-2 and 4-1, the sub-times mapping table of the category B is 1-2, 2-2 and 5-1, and the sub-times mapping table of the category C is 2-2 and 3-2. It is understood that if a word appears in the same category multiple times, the number of times the word appears is counted only once. Since the numerical type data is easier to process than the text type data for the computer, the computer can operate with the number of each word while the amount of calculation can be reduced.
Calculating the word frequency of each word in each text data, calculating the reverse text frequency of each word according to the text data of all categories, and calculating the TF-IDF value between each word and each category in all the text data according to the word frequency and the reverse text frequency. Specifically, the number of times each word appears in the corresponding text data is counted, the number of all words in the text data is counted, and the ratio of the number of times each word appears to the number of words is calculated as the word frequency of each word in the text data. Counting the number of all categories, and calculating the logarithm of the ratio of the number of categories to the number of times each word appears as the reverse text frequency of the word. The TF-IDF value is calculated using a first formula. Specifically, the first formula is
Figure DEST_PATH_IMAGE001
. Wherein,
Figure 976519DEST_PATH_IMAGE002
representing words
Figure DEST_PATH_IMAGE003
And categories
Figure 499074DEST_PATH_IMAGE004
The TF-IDF value in between,
Figure DEST_PATH_IMAGE005
representing categories
Figure 856106DEST_PATH_IMAGE004
In the text data of
Figure 96595DEST_PATH_IMAGE003
The frequency of the words of (a) is,
Figure 516075DEST_PATH_IMAGE006
representing words
Figure 664159DEST_PATH_IMAGE003
The reverse text frequency of (1). The larger the TF-IDF value, the higher the relevance of the word to the category, i.e., the greater the likelihood that the word can be used to describe the category; the smaller the TF-IDF value, the lower the relevance of the word to the category, i.e. the less likely that the word can be used to describe the category.
And step S106, selecting partial words from the text data as candidate words according to the relevance. In the present embodiment, the TF-IDF values corresponding to each category are sorted in descending order. That is, the TF-IDF values between all words in the text data of each category and the category are sorted in descending order. And selecting a preset number of words as candidate words from the word corresponding to the maximum TF-IDF value according to the sequence from big to small. The preset number can be set according to actual conditions. The size of the preset number determines the quality of the candidate word. If the preset number is too small, fewer words can be used for distinguishing the categories; if the predetermined number is too large, the candidate words may include words that cannot be distinguished.
And step S108, calculating mutual information between the preset characteristic words and the candidate words in the preset characteristic word bank. And the preset characteristic words in the preset characteristic word library are used for describing the categories of the text data. In this embodiment, a predetermined feature word library is constructed according to artificial professional knowledge. The preset feature words in the preset feature word library include, but are not limited to, words for describing color, size, material, and the like. For example, the predetermined characteristic words for describing animals include, but are not limited to, tailed, hooded, herbivorous, speckled, terrestrial, aquatic, etc., and the predetermined characteristic words for describing plants include, but are not limited to, ferns, bryophytes, fruity, floriated, etc. It is understood that all the preset feature words in the preset feature word library cannot accurately and completely describe all the categories of the text data. All the preset feature words in the preset feature word library are divided according to categories, so that each category comprises a plurality of preset feature words. The preset feature words can be repeatedly classified into different categories.
Respectively calculating a first probability and a second probability of occurrence of the preset feature words and the candidate words in all text data, calculating a third probability of simultaneous occurrence of the preset feature words and the candidate words in sentences of all text data, and calculating a mean value of point-by-point mutual information between each candidate word and all preset feature words in different categories as mutual information according to the first probability, the second probability and the third probability. Specifically, calculating a ratio of the number of times of occurrence of a preset feature word in all text data to the number of all words in all text data as a first probability of the preset feature word; calculating the ratio of the occurrence frequency of the candidate words in all the text data to the number of all the words in all the text data as a second probability of the candidate words; and calculating the ratio of the number of times of the preset characteristic words and the candidate words appearing in the sentences of all the text data at the same time to the number of all the sentences in all the text data as a corresponding third probability. And calculating point-by-point mutual information between the candidate words and the preset characteristic words by using a second formula. Specifically, the second formula is. Wherein,
Figure DEST_PATH_IMAGE007
as candidate words
Figure 203594DEST_PATH_IMAGE008
And preset feature words
Figure DEST_PATH_IMAGE009
The point-by-point mutual information between the two groups,
Figure 818246DEST_PATH_IMAGE010
as candidate words
Figure 974290DEST_PATH_IMAGE008
Is determined by the first probability of (a) being,
Figure DEST_PATH_IMAGE011
for presetting feature words
Figure 863748DEST_PATH_IMAGE009
Is determined to be the second probability of (c),
Figure 805160DEST_PATH_IMAGE012
as candidate words
Figure 184188DEST_PATH_IMAGE008
And preset feature words
Figure 312681DEST_PATH_IMAGE009
A third probability in between. And calculating the average value of point-by-point mutual information between each candidate word and all preset characteristic words in the same category as the mutual information between the candidate words and the categories. And the point-by-point mutual information is used for representing the correlation between the candidate words and the preset characteristic words. The larger the point-by-point mutual information is, the stronger the correlation between the candidate words and the preset characteristic words is; the smaller the point-by-point mutual information is, the weaker the correlation between the candidate words and the preset feature words is. Mutual information is used to represent the relevance between the candidate words and the categories. The larger the mutual information is, the stronger the correlation between the candidate words and the categories is; the smaller the mutual information, the weaker the correlation between the candidate word and the category is represented.
Step S110, selecting partial candidate words according to the mutual information and adding the partial candidate words into a preset characteristic word bank to form a category semantic word bank. In this embodiment, it is determined whether the mutual information is greater than a preset threshold. And when the mutual information is larger than a preset threshold value, adding the corresponding candidate words into a preset characteristic word bank. It is understood that when the mutual information is greater than the preset threshold, it indicates that the correlation between the candidate word and the category is strong, and the candidate word can be used to describe the category. And adding the candidate words into a preset characteristic word bank for supplementing the preset characteristic word bank to form a category semantic word bank. That is, the category semantic word library includes preset feature words and candidate words that satisfy the condition. The preset threshold value can be set according to actual conditions.
And step S112, performing mask processing on the text data according to the category semantic word bank to obtain a mask text. In this embodiment, the preset feature words and the candidate words satisfying the condition are collectively referred to as category feature words. And performing mask processing on words in the text data according to the category feature words in the category semantic word library to obtain a mask text. The specific procedure of the masking process will be described in detail below.
And S114, training a BERT model according to the mask text to obtain a semantic feature extraction model. The BERT model is trained from the masked text, enabling the BERT model to predict a predicted word vector corresponding to the masked word and a predicted [ CLS ] vector corresponding to the masked text. Wherein the predictive [ CLS ] vector is a semantic representation of the corresponding masked text. The specific process of training the BERT model from the masked text to obtain the semantic feature extraction model will be described in detail below.
Step S116, inputting the text data of each category into a semantic feature extraction model to obtain a corresponding semantic feature vector. It is understood that the semantic feature extraction model can obtain a text [ CLS ] vector corresponding to each category of text data, and the text [ CLS ] vector is a semantic representation of the corresponding category of text data, i.e., a semantic feature vector. The specific process of inputting the text data of each category into the semantic feature extraction model to obtain the corresponding semantic feature vector will be described in detail below.
In some possible embodiments, the text data may be in any type of language. The word segmentation processing modes adopted by the description texts of different types of languages in the preprocessing stage can be different, but the subsequent processes of forming a category semantic word bank and training a BERT model are the same.
In the above embodiment, according to the relevance between each word and each category in the text data of the plurality of categories, part of the words are selected from the text data as candidate words, and then according to the mutual information between the candidate words and the preset feature words in the preset feature word bank, part of the candidate words are selected and added into the preset feature word bank, so as to expand the preset feature word bank as the category semantic word bank, so that the category semantic word bank has the advantages of rich variety and wide coverage. Performing mask processing on text data according to the category semantic word bank to obtain a mask text, and training a BERT model according to the mask text, so that the BERT model can focus on words corresponding to a semantic feature word bank during semantic representation of a coded text, and the coding of the text semantic representation by the BERT model focuses on feature description of categories instead of semantics of a common text. The semantic feature vectors obtained by inputting the text data into the semantic feature extraction model can replace manual labeling, so that the workload of manual labeling is greatly reduced, and zero sample learning can be trained on a large label-free data set. Meanwhile, the characteristic description learned by the BERT model is implicit by utilizing rich semantic knowledge implied by text data of a plurality of categories, so that the BERT model has good adaptability to different categories, and the problem of field drift caused by manual labeling is avoided.
The semantic feature extraction model also has application conditions on the data set labeled by the semantic attributes and other large data sets. Inputting text data of any category into the semantic feature extraction model, a semantic feature vector which focuses on the category feature description of the text data can be obtained. The semantic feature vector can be used as semantic representation of categories in zero sample learning, so that the workload of manual marking is greatly reduced, and the problems that the workload of manual marking of semantic information is complicated and the accuracy of using original text word vectors is reduced are effectively balanced.
Referring to fig. 2 and fig. 5 in combination, fig. 2 is a first sub-flowchart of a semantic feature extraction method according to an embodiment of the present invention, and fig. 5 is a fourth sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention. Step S112 specifically includes the following steps.
Step S202, the words in the text data which are the same as the category characteristic words in the category semantic word library are used as characteristic description words. For example, if the text data is: "Zebras grass-eating animals in parts of Africa, facial disturbed stressed substrates, human bodies, and mental speed programs, use power full jaws help videos and eating animals, and un-eating animals of positive animals good sense of health", and the category feature words in the category semantic word library include grass-eating and strained. Then the grass-marking and ripped in the text data are feature descriptors.
Step S204, the feature descriptors and words in the text data, the distance between which and the feature descriptors does not exceed a preset distance, are used as feature word pairs. Preferably, the feature descriptors and nouns in the text data, which are not more than a preset distance from the feature descriptors, are jointly constructed as feature word pairs. Wherein the preset distance is 3 word distances. For example, words having a distance of not more than 3 word distances from the feature descriptor grass-associating include zebras and animals, and words having a distance of not more than 3 word distances from the feature descriptor striped include dates and books. The feature word pairs include zebras grass-eating, grass-eating animals, ripped dates, and ripped books. It is understood that nouns whose distance from the feature descriptors does not exceed a preset distance can be considered to be strongly related to the feature descriptors. Therefore, the noun and the corresponding feature descriptor are used as a feature word pair.
Step S206, mask processing is performed on the feature descriptors or feature descriptor and feature word pairs in the text data to obtain a mask text. The mask processing specifically includes the following steps.
In step S2061, the length of each sentence in the text data is calculated. Taking the periods as separating symbols, the number of words included in each sentence is calculated as the corresponding length.
In step S2062, the feature descriptors in each sentence are randomly masked. And randomly selecting feature descriptors in the sentence for masking.
In step S2063, it is determined whether the ratio between the number of the masked feature descriptors in each sentence and the corresponding length is equal to a preset ratio. In the process of masking, the ratio of the number of the feature descriptors which are masked in the sentence to the length of the corresponding sentence is calculated, and whether the ratio is equal to a preset ratio or not is judged. Wherein the preset ratio is 10-15%. Preferably, the preset ratio is 10%. When the ratio between the number of the masked feature descriptors in each sentence and the corresponding length is smaller than the preset ratio, performing step S2064; when the ratio between the number of the masked feature descriptors in each sentence and the corresponding length is equal to the preset ratio, step S2066 is performed. It is understood that when the ratio between the number of the feature descriptors masked in the sentence and the corresponding length reaches a preset ratio, step S2066 is performed; when all the feature descriptors in the sentence are masked, but the ratio between the number of the masked feature descriptors in the sentence and the corresponding length is still smaller than the preset ratio, step S2064 is performed.
In step S2064, the feature word pairs in each sentence are randomly masked. And randomly selecting feature word pairs in the sentences to carry out mask. It can be understood that the feature descriptors in the feature word pairs are already masked, and when the ratio between the number of the masked feature descriptors in each sentence and the corresponding length is smaller than a preset ratio, words corresponding to non-feature descriptors in the feature word pairs in the sentence are randomly selected for masking.
Step S2065, determine whether the ratio between the number of the feature descriptors and feature word pairs masked in each sentence and the corresponding length is equal to a preset ratio. In the process of masking, the ratio of the number of the feature descriptors and feature word pairs masked in the sentence to the length of the corresponding sentence is calculated, and whether the ratio is equal to a preset ratio or not is judged. It is understood that the number of feature descriptors and feature word pairs includes the number of feature descriptors and the number of non-feature descriptors in the feature word pairs. When the ratio between the number of the feature descriptors and feature word pairs masked in each sentence and the corresponding length is equal to a preset ratio, step S2066 is performed.
In step S2066, a mask text is obtained. It can be understood that when the ratio of the number of words to be masked in the sentence to the corresponding sentence length is equal to the preset ratio, the other words in the sentence are not masked again, and the mask text is obtained.
In the above embodiment, the feature descriptor and the feature word pair in the text data are obtained according to the category feature words in the category semantic word library, and the feature descriptor or the feature descriptor and the feature word pair are masked to obtain the mask text, so that in the process of training the BERT model by using the mask text, the BERT model can pay more attention to the feature descriptor or the feature descriptor and the feature word pair, and the encoding of the text semantic representation by the BERT model can pay more attention to the feature description of the category. The ratio of the number of the masked words to the sentence length is a preset ratio, so that the BERT model can effectively understand the relationship between the masked words and the context without causing too great difficulty to the understanding of the BERT model, and the BERT model has enough capacity to predict the predicted word vectors of all the masked words in the sentence.
Please refer to fig. 3, which is a second sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention. Step S114 specifically includes the following steps.
Step S302, inputting the mask text into a BERT model to obtain a predicted word vector and a predicted [ CLS ] vector. Wherein the predicted word vector corresponds to the feature descriptor and the predicted [ CLS ] vector corresponds to the masked text. In this embodiment, one feature descriptor corresponds to one predicted word vector, and the predicted word vector is a word vector obtained by predicting the feature descriptor by combining a BERT model with the context of the feature descriptor; one mask text corresponds to one prediction [ CLS ] vector, and the prediction [ CLS ] vector is a vector obtained by coding semantic information of the mask text by a BERT model.
Step S304, a first loss value is constructed according to the feature descriptors and the predicted word vectors. In the present embodiment, the first loss value is constructed using a third formula. Specifically, the third formula is
Figure DEST_PATH_IMAGE013
. Wherein,
Figure 258028DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE015
which represents the value of the first loss to be,
Figure 788366DEST_PATH_IMAGE003
a representation of a word is shown,
Figure 338297DEST_PATH_IMAGE016
representing the number of words in the sentence,
Figure DEST_PATH_IMAGE017
a vector representing the masked feature descriptors,
Figure 422927DEST_PATH_IMAGE018
the prediction word vector is represented as a probability vector of each category characteristic word in the category semantic word library,
Figure DEST_PATH_IMAGE019
representing masked words
Figure 169035DEST_PATH_IMAGE003
The predicted-word vector of (a) is,
Figure 616197DEST_PATH_IMAGE020
which is indicative of a first parameter of the image,
Figure DEST_PATH_IMAGE021
representing the second parameter. Wherein,
Figure 477974DEST_PATH_IMAGE022
is 1 or 0. When the word
Figure 643376DEST_PATH_IMAGE003
When the mask is applied,
Figure 412749DEST_PATH_IMAGE022
is 1; when the word
Figure 979996DEST_PATH_IMAGE003
When the mask is not applied,
Figure 793100DEST_PATH_IMAGE022
is 0.
Figure 649061DEST_PATH_IMAGE017
The one-hot vector is adopted, and each numerical value in the vector corresponds to each category feature word in the category semantic word bank one by one; the numerical value corresponding to the feature descriptor and the corresponding category feature word is 1, and the rest numerical values are 0. The first parameter and the second parameter are parameters of an output layer of the BERT model that outputs the predicted word vector.
Step S306, predicting [ CLS ] according to the feature word pair sum]The vector constructs a second penalty value. In the present embodiment, the second loss value is constructed using the fourth formula. Specifically, the fourth formula is
Figure DEST_PATH_IMAGE023
. Wherein,
Figure 487704DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE025
the value of the second loss is represented,
Figure 847141DEST_PATH_IMAGE026
a pair of the representative feature words is represented,
Figure DEST_PATH_IMAGE027
representing the number of masked feature word pairs in the sentence,
Figure 627884DEST_PATH_IMAGE028
a vector representing all the masked feature word pairs in the sentence,
Figure DEST_PATH_IMAGE029
the function is represented by a function of,
Figure 705562DEST_PATH_IMAGE030
represents passing through
Figure 82316DEST_PATH_IMAGE029
Prediction after function normalization [ CLS]The probability vector of the vector is determined,
Figure DEST_PATH_IMAGE031
to representPrediction [ CLS]The vector of the vector is then calculated,
Figure 561839DEST_PATH_IMAGE032
which is indicative of a third parameter of the first,
Figure DEST_PATH_IMAGE033
representing a fourth parameter. Wherein,
Figure 516413DEST_PATH_IMAGE028
the one-hot vector is adopted, and each numerical value in the vector corresponds to each category feature word in the category semantic word bank one by one; the corresponding numerical values of the words of all the masked feature word pairs and the corresponding category feature words in one sentence are 1, and the rest numerical values are 0. The third parameter and the fourth parameter are output prediction [ CLS ] in the BERT model]Parameters of the output layer of the vector.
And S308, training a BERT model according to the first loss value or the first loss value and the second loss value to obtain a semantic feature extraction model. It can be understood that when only the feature descriptors in the mask text are masked, the BERT model is trained according to the first loss value; when the mask is used, the BERT model is trained according to the first loss value and the second loss value. When the BERT model is trained based on the first loss value and the second loss value, a total loss value is calculated based on the first loss value and the second loss value using a fifth formula. Specifically, the fifth formula is
Figure 815808DEST_PATH_IMAGE034
. Wherein,
Figure DEST_PATH_IMAGE035
the value of the total loss is expressed,
Figure 996253DEST_PATH_IMAGE015
which represents the value of the first loss to be,
Figure 127020DEST_PATH_IMAGE025
representing a second loss value. The first loss value or the total loss value is minimized, so that the first parameter and the second parameter are updatedAnd obtaining a BERT model with stable first parameters, second parameters, third parameters and fourth parameters as a semantic feature extraction model.
In the above embodiment, a first loss value is constructed according to the predicted word vector and the feature descriptor output by the BERT model, a second loss value is constructed according to the predicted [ CLS ] vector and the feature word pair output by the BERT model, and the parameter in the BERT model is updated according to the first loss value or the first loss value and the second loss value, so that the BERT model with stable parameters is obtained as the semantic feature extraction model, and the semantic feature extraction model has the capability of extracting the category semantic features.
Please refer to fig. 4, which is a third sub-flowchart of the semantic feature extraction method according to the embodiment of the present invention. Step S116 specifically includes the following steps.
Step S402, inputting the text data of each category into a semantic feature extraction model to obtain a corresponding text [ CLS ] vector. The text [ CLS ] vector is obtained by encoding semantic information of text data by a semantic feature extraction model.
Step S404, the text [ CLS ] vector is used as a semantic feature vector. The semantic feature vector can be used for replacing manually labeled class attribute information and used as input data of zero sample learning.
In this embodiment, text data of a category that is not used for training the BERT model is input to the semantic feature extraction model, and a corresponding text [ CLS ] vector can also be obtained as a semantic feature vector.
In the above embodiment, the text data is input into the semantic feature extraction model to obtain the text [ CLS ] vector, and the text [ CLS ] vector can accurately describe the category semantics of the text data, so that the text [ CLS ] vector is used as the semantic feature vector of the corresponding category. The semantic feature vector can replace manual labeling, so that the manual labeling cost of zero sample learning is greatly reduced.
Please refer to fig. 6, which is a schematic diagram of an internal structure of a terminal according to an embodiment of the present invention. The terminal 10 includes a computer-readable storage medium 11, a processor 12, and a bus 13. The computer-readable storage medium 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The computer readable storage medium 11 may in some embodiments be an internal storage unit of the terminal 10, such as a hard disk of the terminal 10. The computer readable storage medium 11 may also be, in other embodiments, an external storage device of the terminal 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the terminal 10. Further, the computer-readable storage medium 11 may also include both an internal storage unit and an external storage device of the terminal 10. The computer-readable storage medium 11 may be used not only to store application software and various types of data installed in the terminal 10 but also to temporarily store data that has been output or will be output.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
Further, the terminal 10 may also include a display assembly 14. The display component 14 may be a Light Emitting Diode (LED) display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch panel, or the like. The display component 14 may also be referred to as a display device or display unit, as appropriate, for displaying information processed in the terminal 10 and for displaying a visual user interface, among other things.
Further, the terminal 10 may also include a communication component 15. The communication component 15 may optionally include a wired communication component and/or a wireless communication component, such as a WI-FI communication component, a bluetooth communication component, etc., typically used to establish a communication connection between the terminal 10 and other intelligent control devices.
The processor 12 may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip for executing program codes stored in the computer-readable storage medium 11 or Processing data. Specifically, the processor 12 executes a processing program to control the terminal 10 to implement the semantic feature extraction method.
Fig. 6 only shows the terminal 10 with components 11-15 for implementing the semantic feature extraction method, and it will be understood by those skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the terminal 10, and that the terminal 10 may comprise fewer or more components than shown, or combine some components, or a different arrangement of components.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, insofar as these modifications and variations of the invention fall within the scope of the claims of the invention and their equivalents, the invention is intended to include these modifications and variations.
The above-mentioned embodiments are only examples of the present invention, which should not be construed as limiting the scope of the present invention, and therefore, the present invention is not limited by the claims.

Claims (9)

1. A semantic feature extraction method is characterized by comprising the following steps:
acquiring text data of a plurality of categories, wherein the text data of each category comprises a plurality of words;
calculating the relevance of each word in all the text data and each category;
selecting partial words from the text data as candidate words according to the relevance;
calculating mutual information between preset characteristic words in a preset characteristic word bank and the candidate words, wherein the preset characteristic words in the preset characteristic word bank are used for describing the types of the text data;
selecting partial candidate words according to the mutual information and adding the partial candidate words into the preset feature word library to form a category semantic word library;
performing mask processing on the text data according to the category semantic word bank to obtain a mask text;
training a BERT model according to the mask text to obtain a semantic feature extraction model, wherein the training of the BERT model according to the mask text to obtain the semantic feature extraction model specifically comprises:
taking words in the text data, which are the same as the category characteristic words in the category semantic word library, as characteristic descriptors;
taking the feature descriptors and words in the text data, the distance between which and the feature descriptors does not exceed a preset distance, as feature word pairs;
inputting the mask text into the BERT model to obtain a predicted word vector and a predicted [ CLS ] vector, wherein the predicted word vector corresponds to the feature descriptors and the predicted [ CLS ] vector corresponds to the mask text;
constructing a first loss value according to the feature descriptors and the predictor vector;
constructing a second loss value according to the feature word pair and the prediction [ CLS ] vector; and
training the BERT model according to the first loss value or the first loss value and the second loss value to obtain the semantic feature extraction model; and
and inputting the text data of each category into the semantic feature extraction model to obtain a corresponding semantic feature vector.
2. The semantic feature extraction method according to claim 1, wherein masking the text data according to the category semantic thesaurus to obtain a masked text specifically comprises:
and carrying out mask processing on the feature descriptors or feature descriptor and feature word pairs in the text data to obtain the mask text.
3. The semantic feature extraction method according to claim 1, wherein inputting the text data of each of the categories into the semantic feature extraction model to obtain the corresponding semantic feature vector specifically comprises:
inputting the text data of each category into the semantic feature extraction model to obtain a corresponding text [ CLS ] vector; and
taking the text [ CLS ] vector as the semantic feature vector.
4. The semantic feature extraction method according to claim 2, wherein masking the feature descriptors or feature descriptor and feature word pairs in the text data to obtain the masked text specifically comprises:
calculating the length of each sentence in the text data;
randomly masking the feature descriptors in each sentence;
judging whether the ratio of the number of the masked feature descriptors in each sentence to the corresponding length is equal to a preset ratio or not;
when the ratio of the number of the masked feature descriptors to the corresponding length in each sentence is equal to a preset ratio, obtaining the mask text;
when the ratio of the number of the characteristic descriptors which are masked in each sentence to the corresponding length is smaller than a preset ratio, masking the characteristic word pairs in each sentence randomly;
judging whether the ratio of the number of the feature descriptors and feature word pairs masked in each sentence to the corresponding length is equal to the preset ratio or not; and
and when the ratio of the number of the characteristic describing words and the characteristic word pairs which are masked in each sentence to the corresponding length is equal to the preset ratio, obtaining the mask text.
5. The semantic feature extraction method according to claim 1, wherein calculating the relevance of each word to each category in all the text data specifically comprises:
calculating the word frequency of each word in each text data;
calculating the reverse text frequency of each word according to the text data of all the categories; and
and calculating TF-IDF values between each word and each category in all the text data according to the word frequency and the reverse text frequency.
6. The semantic feature extraction method according to claim 5, wherein the selecting partial words from the text data as candidate words according to the relevance specifically comprises:
sorting the TF-IDF values corresponding to each of the categories in descending order; and
and selecting a preset number of words as the candidate words according to the sequence from large to small from the word corresponding to the maximum TF-IDF value.
7. The semantic feature extraction method according to claim 1, wherein the calculating of the mutual information between the preset feature words in the preset feature word bank and the candidate words specifically comprises:
respectively calculating a first probability and a second probability of the preset feature words and the candidate words appearing in all the text data;
calculating a third probability that the preset feature words and the candidate words appear in sentences of all the text data at the same time; and
and calculating the mean value of point-by-point mutual information between each candidate word and all preset feature words in different categories as the mutual information according to the first probability, the second probability and the third probability.
8. The semantic feature extraction method according to claim 1, wherein selecting a part of candidate words to add to the preset feature lexicon to form a category semantic lexicon according to the mutual information specifically comprises:
judging whether the mutual information is larger than a preset threshold value or not; and
and when the mutual information is larger than the preset threshold value, adding the corresponding candidate word into the preset feature word library.
9. A computer-readable storage medium for storing program instructions executable by a processor to implement a semantic feature extraction method according to any one of claims 1 to 8.
CN202111589256.7A 2021-12-23 2021-12-23 Semantic feature extraction method and computer-readable storage medium Active CN113971407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111589256.7A CN113971407B (en) 2021-12-23 2021-12-23 Semantic feature extraction method and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111589256.7A CN113971407B (en) 2021-12-23 2021-12-23 Semantic feature extraction method and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN113971407A CN113971407A (en) 2022-01-25
CN113971407B true CN113971407B (en) 2022-03-18

Family

ID=79590769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111589256.7A Active CN113971407B (en) 2021-12-23 2021-12-23 Semantic feature extraction method and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN113971407B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026937B (en) * 2019-11-13 2021-02-19 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting POI name and computer storage medium
US11120585B2 (en) * 2019-11-28 2021-09-14 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for image reconstruction
CN111428490B (en) * 2020-01-17 2021-05-18 北京理工大学 Reference resolution weak supervised learning method using language model
US11574128B2 (en) * 2020-06-09 2023-02-07 Optum Services (Ireland) Limited Method, apparatus and computer program product for generating multi-paradigm feature representations
CN111428514A (en) * 2020-06-12 2020-07-17 北京百度网讯科技有限公司 Semantic matching method, device, equipment and storage medium
CN111984793A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Text emotion classification model training method and device, computer equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Fast Extraction of Semantic Features from a Latent Semantic Indexed Text Corpus";A. Kabán 等;《Neural Processing Letters》;20020215;第15卷(第1期);第31-34页 *
"基于句法语义特征的中文实体关系抽取";甘丽新 等;《计算机研究与发展》;20160215;第53卷(第2期);第284-302页 *

Also Published As

Publication number Publication date
CN113971407A (en) 2022-01-25

Similar Documents

Publication Publication Date Title
CN108717406B (en) Text emotion analysis method and device and storage medium
CN111222305B (en) Information structuring method and device
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110188202B (en) Training method and device of semantic relation recognition model and terminal
CN109446517B (en) Reference resolution method, electronic device and computer readable storage medium
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
CN111177326A (en) Key information extraction method and device based on fine labeling text and storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN109086814B (en) Data processing method and device and network equipment
CN112836487B (en) Automatic comment method and device, computer equipment and storage medium
CN111488931A (en) Article quality evaluation method, article recommendation method and corresponding devices
CN116402063B (en) Multi-modal irony recognition method, apparatus, device and storage medium
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN110399547B (en) Method, apparatus, device and storage medium for updating model parameters
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN110737774A (en) Book knowledge graph construction method, book recommendation method, device, equipment and medium
CN113392218A (en) Training method of text quality evaluation model and method for determining text quality
CN117390140B (en) Chinese aspect emotion analysis method and system based on machine reading understanding
CN114090793A (en) Information extraction method and device, electronic equipment, computer readable medium and product
CN114547232A (en) Nested entity identification method and system with low labeling cost
CN113971407B (en) Semantic feature extraction method and computer-readable storage medium
CN110837732A (en) Method and device for identifying intimacy between target people, electronic equipment and storage medium
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN115129885A (en) Entity chain pointing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 518049 Floor 25, Block A, Zhongzhou Binhai Commercial Center Phase II, No. 9285, Binhe Boulevard, Shangsha Community, Shatou Street, Futian District, Shenzhen, Guangdong

Patentee after: Shenzhen Youjia Innovation Technology Co.,Ltd.

Address before: 518049 401, building 1, Shenzhen new generation industrial park, No. 136, Zhongkang Road, Meidu community, Meilin street, Futian District, Shenzhen, Guangdong Province

Patentee before: SHENZHEN MINIEYE INNOVATION TECHNOLOGY Co.,Ltd.

CP03 Change of name, title or address