CN117216250A - Concept word screening method, device, computer equipment and storage medium - Google Patents

Concept word screening method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117216250A
CN117216250A CN202210616004.7A CN202210616004A CN117216250A CN 117216250 A CN117216250 A CN 117216250A CN 202210616004 A CN202210616004 A CN 202210616004A CN 117216250 A CN117216250 A CN 117216250A
Authority
CN
China
Prior art keywords
candidate
concept
concept words
positive class
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210616004.7A
Other languages
Chinese (zh)
Inventor
吴钟强
王佩璐
赵启
郭奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210616004.7A priority Critical patent/CN117216250A/en
Publication of CN117216250A publication Critical patent/CN117216250A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present application relates to a method, an apparatus, a computer device, a computer readable storage medium and a computer program product for concept word screening. The method comprises the following steps: acquiring candidate texts; carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text; based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probabilities of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting the concept word eliminating condition; screening concept words meeting the positive class probability condition from the candidate concept words; the conceptual vocabulary is used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic. The method is not limited by the specific type of the candidate text, and is beneficial to expanding the application scene of the concept word screening method.

Description

Concept word screening method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer application technology, and in particular, to a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for concept word screening.
Background
The concept words are used to characterize a set of entities that includes entity objects of at least two common characteristics. Concept words are an important means for people to understand the world and things, and are widely applied to numerous scenes such as question answering, searching, reading and understanding, and the like.
The traditional concept word screening method comprises the steps of defining a seed template containing a seed text in advance, carrying out text matching on the seed template and a candidate text, and removing the seed text in the candidate text to obtain a concept word. For example, if the seed template is "XXX ranking list", and the candidate text is "new energy automobile ranking list", the obtained concept word is "new energy automobile". The conventional concept word screening method is not suitable for the situation that the candidate text is a concept word because the seed text contained in the seed template needs to be removed from the candidate text due to the fact that the seed template is relied on in the screening process. Therefore, the conventional concept word screening method has the defect of limited application scenes.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a concept word screening method, apparatus, computer device, computer-readable storage medium, and computer program product that can expand the application scenarios of the concept word screening method.
In a first aspect, the present application provides a concept word screening method. The method comprises the following steps:
acquiring candidate texts;
carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text;
based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probability of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
In a second aspect, the present application further provides a concept word screening device, which is characterized in that the device includes:
the acquisition module is used for acquiring candidate texts;
the candidate concept word determining module is used for carrying out grammar analysis on the candidate text and determining candidate concept words meeting grammar conditions in the candidate text;
the positive class probability prediction module is used for carrying out positive class probability prediction on the candidate concept words based on a target neural network model to obtain positive class probabilities of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
The concept word determining module is used for screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring candidate texts;
carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text;
based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probability of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring candidate texts;
carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text;
based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probability of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring candidate texts;
carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text;
Based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probability of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
The method, the device, the computer equipment, the computer readable storage medium and the computer program product for screening the concept words acquire candidate texts, the grammar analysis is carried out on the candidate texts, the candidate concept words meeting the grammar conditions are screened from the candidate texts, then the positive type concept prediction is carried out on the candidate concept words based on the target neural network model, the concept words meeting the positive type probability conditions are screened from the candidate concept words, the concept words of the entity set are characterized, the candidate concept words in the candidate texts can be determined through the grammar analysis firstly, the positive type probability prediction is carried out on the candidate concept words based on the target neural network model, the concept words meeting the positive type probability conditions are further screened, and the method is not limited by the specific types of the candidate texts, so that the application scene of the concept word screening method is facilitated to be expanded.
Drawings
FIG. 1 is a diagram of an application environment for a concept word screening method in one embodiment;
FIG. 2 is a flow diagram of a method of concept word screening in one embodiment;
FIG. 3 is a flow chart of obtaining positive class probabilities of candidate concept words based on a CNN model in one embodiment;
FIG. 4 is a flow chart of a method for concept word screening in another embodiment;
FIG. 5 is a flow diagram of determining alternative text from candidate text in one embodiment;
FIG. 6 is a flow diagram of determining concept words from candidate concept words based on a first sub-model and a second sub-model of a concatenation in one embodiment;
FIG. 7 is a flow chart of a method of concept word screening in yet another embodiment;
FIG. 8 is a flow diagram of concept word screening based on cascaded three-level discriminators in one embodiment;
FIG. 9 is a block diagram of a concept word screening apparatus in one embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, the concept word screening method provided by the application can be applied to an application environment as shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The producer of the candidate text is not limited, and may be uploaded to the server 104 by the user via the terminal 102, or may be obtained from a data storage system by the server 104. Specifically, in the process of the server 104 performing the concept word screening: obtaining a candidate text, carrying out grammar analysis on the candidate text, determining candidate concept words meeting grammar conditions in the candidate text, then carrying out positive type probability prediction on the candidate concept words based on a positive type sample consisting of the concept words and a target neural network model trained by a negative type sample consisting of non-concept words meeting concept word rejection conditions, obtaining positive type probability of the candidate concept words, and further screening the candidate concept words meeting the positive type probability conditions to characterize a entity set.
In one embodiment, the concept word screening method provided by the present application may only involve the terminal 102 in the application environment where the computing processing capability of the terminal 102 meets the requirements. Specifically, candidate texts are acquired by the terminal 102, and concept words are screened from the candidate texts.
Among other things, the terminal 102 includes, but is not limited to, desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices. The internet of things equipment can be an intelligent sound box, an intelligent television, an intelligent air conditioner, intelligent vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
In one embodiment, as shown in fig. 2, a concept word screening method is provided, where this embodiment is applied to the server 104 for illustration, and it is understood that the method may also be applied to the terminal 102, and may also be applied to a system including the terminal 102 and the server 104, and implemented through interaction between the terminal 102 and the server 104. In this embodiment, the method includes the steps of:
Step S201, a candidate text is acquired.
The candidate text refers to a text serving as a candidate in the process of screening the concept words, namely, the purpose of screening the concept words is to screen the concept words from the candidate text. The language, text composition and expertise of the candidate text are not unique. The languages of the candidate text can be Chinese, english, japanese and the like, and the candidate text can be composed of single characters, words composed of a plurality of characters, and phrases or sentences composed of a plurality of words. Taking chinese as an example, the candidate text may only include single characters such as "goose", "cloud", "rain", and may be words composed of a plurality of characters such as "continent", "mobile phone", and may also be phrases or sentences composed of a plurality of words such as "mammal", "smart phone", "four-heaven king", "how the smart phone is designed", and the like. The specialized fields of candidate text may be economic fields, manufacturing fields, and the like. In summary, the application is not limited to a particular type of candidate text.
Specifically, the source of the candidate text may be a terminal where the user is located, or may be a data storage system of the server itself. For example, the candidate text may be uploaded to the server by the user through the terminal, the candidate text may be obtained by the server, and the concept words may be screened from the candidate text. Further, the specific way of obtaining the candidate text by the server may be active obtaining or passive receiving.
And step S203, analyzing the grammar of the candidate text, and determining candidate concept words which accord with grammar conditions in the candidate text.
The grammar is also called language line law, and is the linguistic branch for researching the composition and change of words, the structural rule and type of phrases and sentences, and the like. Parsing the candidate text is a process of parsing to obtain grammatical feature information of the candidate text. The grammar characteristic information may specifically include at least one of character characteristic information, lexical characteristic information, and syntax characteristic information. The character characteristic information may include at least one of characters constituting the candidate text, positions of the respective characters in the candidate text; the lexical feature information may include at least one of information of candidate words constituting the candidate text, parts of speech of each candidate word, dependency relationships between each candidate word, and positions of each candidate word in the candidate text; the syntactic characteristic information may include at least one of a sentence structure and a sentence use of the candidate text. The sentence pattern structure can comprise simple sentences, parallel sentences, compound sentences and the like, and the sentence pattern use can comprise statement sentences, question sentences, imperative sentences, exclamation sentences and the like.
Further, the grammar condition may include at least one of the categories of the grammar matching condition and the grammar eliminating condition, and the corresponding candidate text meets the grammar condition, which means that the candidate text meets the grammar matching condition and/or the candidate text does not meet the grammar eliminating condition. For example, in the case where the grammar conditions include the grammar removal condition "candidate text is a question sentence", then the candidate text "why a season is generated" does not conform to the grammar conditions; in the case that the grammar condition includes the grammar matching condition "the candidate text includes the adverbs", the candidate text "four heavenly king" accords with the grammar condition. The grammar conditions can only comprise one grammar characteristic condition or can be formed by combining a plurality of grammar characteristic conditions, and the specific types of the grammar characteristic conditions can be the same or different. The grammar characteristic condition is a grammar condition set based on the grammar characteristic information, and the specific type of the grammar characteristic condition can be character composition condition, part-of-speech sequence condition, syntax condition and the like. In a specific embodiment, the character composition condition may include at least one of "the candidate text includes the target character", and "the position of the target character in the candidate text is the target position", the part-of-speech sequence condition may include at least one of "the candidate text includes the candidate word of the target part of speech", and "the position of the candidate word of the target part of speech in the candidate text is the target position", the syntax condition may include at least one of "the sentence pattern of the candidate text for the target purpose", and "the sentence pattern of the candidate text for the target structure", and so on.
It may be understood that, if the candidate text is a phrase or sentence, the process of parsing the candidate text includes: and performing word segmentation processing on the candidate text to obtain a word segmentation processing result of the candidate text, and obtaining grammar characteristic information of the candidate text based on the word segmentation processing result. The word segmentation processing is a process of splitting the candidate text into a plurality of candidate words. Correspondingly, the word segmentation processing result comprises a set of candidate words constituting the candidate text. For example, for the candidate text "why season will occur", word segmentation processing is performed to obtain "why, what will occur, and what will occur". Further, different types of candidate text may employ different word segmentation methods. For example, for English, word segmentation may be performed by spaces between words; for Chinese, natural language processing (Natural Language Processing, NLP) algorithm can be adopted for word segmentation; for a specific professional field, a word segmentation algorithm can be designed according to the candidate text characteristics of the professional field, so that word segmentation processing is realized, and the like. Correspondingly, the server can match the corresponding word segmentation processing method according to the specific type of the candidate text, and perform word segmentation processing on the candidate text based on the word segmentation processing method to obtain a word segmentation processing result of the candidate text.
Specifically, the server parses the candidate text to obtain grammar feature information of the candidate text, compares the grammar feature information with grammar conditions based on the grammar feature information, and then determines feature comparison results of the candidate text to further determine candidate concept words meeting the grammar conditions in the candidate text. Further, the server may determine, based on the result of the parsing, the grammar feature information corresponding to the grammar feature condition and the candidate text according to the specific type of the grammar feature condition included in the grammar condition, and compare the grammar feature information with the corresponding grammar feature condition, so as to improve efficiency. For example, in the case that the grammar condition includes a character composition condition and a syntax condition, the server obtains character feature information and syntax feature information of the candidate text based on the result of the grammar analysis, compares the character feature information with the character composition condition, compares the syntax feature information with the syntax condition, and determines a feature comparison result of the candidate text.
It will be appreciated that where the candidate text is a sentence, the grammar conditions may also include grammar extraction conditions, which may include at least one of character composition conditions, part-of-speech sequence conditions. The server can analyze the grammar of the candidate text, firstly determine the candidate text which meets the grammar matching condition and does not meet the grammar removing condition from the candidate text, and further extract the candidate concept words which meet the grammar extracting condition in the candidate text on the basis. For example, if the grammar matching condition is "the candidate text contains adjectives and nouns", and the grammar extracting condition is "the phrase composed of adjectives and names", the server may determine how the candidate text "smart phone is designed" as the candidate text satisfying the grammar matching condition ", and further determine the candidate concept word" smart phone "satisfying the grammar extracting condition from the candidate text. For ease of understanding, the following description will refer to candidate text as a word or phrase.
Step S205, positive class probability prediction is carried out on candidate concept words based on the target neural network model, and positive class probability of the candidate concept words is obtained.
The target neural network model is a neural network model trained based on a positive class sample consisting of concept words and a negative class sample consisting of non-concept words meeting the concept word rejection condition. It will be appreciated that both the positive and negative class samples carry a corresponding type of sample tag. The positive class samples consist of conceptual words, the carried sample labels are positive classes, the negative class samples consist of non-conceptual words, and the carried sample labels are negative classes. The specific type of concept word culling condition is not unique, and may include at least one of a grammar culling condition and a semantic culling condition, for example. The grammar removing condition is a removing condition determined based on the grammar feature information, for example, the grammar removing condition may be "candidate concept words ending with verbs or adjectives", or "the position of the target character in the candidate text is the target position".
Further, the grammar removing condition of the negative class sample may be the same as or different from the grammar removing condition of the candidate concept word determined in step S203. The semantic culling condition is a culling condition set based on semantic features, for example, the semantic culling condition may be "question and answer intention candidate concept words", or the like. In addition, the specific network structure of the neural network model is not unique, and may include at least one of a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recursive Neural Network, RNN), a Long Short-Term Memory (LSTM), and a transducer neural network, for example.
Further, the target neural network model may include only one neural network model, or may include a plurality of different sub-models, and the network structures of the sub-models may be the same or different, and the training parameters of the sub-models may be the same or different. It will be appreciated that, because of the differences between the sub-models, the training parameters of two sub-models of the same network structure, including at least one of training samples, loss functions, must be different. In one embodiment, the target neural network model includes a plurality of cascaded sub-models. That is, the inputs of the first-level submodel are candidate concept words, and the inputs of each level submodel except the first-level submodel are candidate concept words satisfying the sub-positive class probability condition corresponding to the previous level submodel.
Specifically, the server predicts the positive class probability of the candidate concept words based on the target neural network model, and can obtain the positive class probability of each candidate concept word. The positive class probability of the candidate concept words is predicted by the target neural network model, and the candidate concept words are probability of the concept words. It can be appreciated that if the candidate concept word is included in the positive class sample for training the target neural network model, the positive class probability of the candidate concept word is 1; if the candidate concept word is contained in the negative class sample used for training the target neural network model, the positive class probability of the candidate concept word is 0; in other cases, the positive class probability of the candidate concept word is between 0 and 1.
In one embodiment, step S205 includes: extracting characteristics of candidate concept words; and carrying out feature matching analysis on the features of the candidate concept words to obtain the positive class probability of the candidate concept words.
Wherein the characteristics of the candidate concept words refer to information that can characterize the characteristics of the candidate concept words. The feature details of the candidate concept words may include at least one of text features for characterizing character formations, and semantic features for characterizing semantics. It can be understood that the process of training to obtain the target neural network model by using the positive class sample composed of the concept words and the negative class sample composed of the non-concept words is equivalent to the process of learning the reference characteristics of the concept words and the non-concept words.
Specifically, the server extracts the features of the candidate concept words, performs feature matching analysis on the features of the candidate concept words and the learned reference features to obtain the positive class probability of the candidate concept words, and screens the candidate concept words to obtain the concept words with the positive class probability meeting the positive class probability condition. The process of obtaining the positive class probability of the candidate concept word through the feature matching analysis can be a process of calculating the feature similarity between the features of the candidate concept word and the reference features and determining the positive class probability based on the feature similarity; it may also be a process of mapping a feature to a feature space characterized by a reference feature, and determining a positive class probability based on the mapping result of the feature. In fig. 3, the embedding layer converts the candidate concept words into word vectors, and then inputs the word vectors into the convolutional neural network model and the softmax function for feature matching analysis, so as to obtain the positive class probability of the candidate concept words. In the embodiment, the features of the candidate concept words are extracted first, redundant information in the candidate concept words can be removed, the effect of dimension reduction is achieved, and on the basis, feature matching analysis is carried out, so that accuracy can be ensured and meanwhile working efficiency can be improved.
And S207, screening concept words meeting the positive class probability condition from the candidate concept words.
Wherein the concept words are used to characterize a set of entities comprising at least two entity objects having common characteristics. The physical object may be a human, animal, plant or industrial product, or the like. For example, "seven continents" may characterize a collection of entities made up of entity objects that all belong to the earth continent, such as "asia", "africa" and "north america"; the goose can represent a solid collection formed by animal solid objects under the subfamilies of the ducks of the geese such as the wild goose, the gray goose, the wild goose and the like; the intelligent mobile phone can represent an entity set formed by a mobile phone with an independent operating system and an independent running space, can be used for installing software by a user, and can realize wireless network access. The concept words may include at least one of entity descriptions belonging to the concept words, concept words constructed based on entity attributes of the knowledge graph, and manually labeled concept words. For example, "seven continents", "chinese actors", etc. are entity descriptions belonging to concept words, "the work of composer a", "the tool book of course B", etc. are concept words constructed based on the entity attributes of the knowledge graph.
Specifically, after the server obtains the positive class probability of the candidate concept words, the positive class probability is compared with the positive class probability condition, and concept words meeting the positive class probability condition can be screened from the candidate concept words. Further, the positive class probability condition may be that the positive class probability is greater than a probability threshold, or that the positive class probability is greater than or equal to the probability threshold. In addition, in the case that the target neural network model includes a plurality of sub-models, each sub-model can perform positive class probability prediction to obtain a corresponding sub-positive class probability. On the basis, the candidate concept words meet the positive class probability conditions, which means that the candidate concept words meet the sub-positive class probability conditions corresponding to all sub-models, or that the probability results obtained by carrying out mathematical operation on all sub-positive class probabilities meet the positive class probability conditions. Wherein the mathematical operation may comprise at least one or a combination of more of averaging, maximizing, weighted summing.
According to the concept word screening method, the server firstly acquires the candidate text, carries out grammar analysis on the candidate text, screens out candidate concept words meeting grammar conditions from the candidate text, carries out positive type concept prediction on the candidate concept words based on the target neural network model, screens out concept words meeting positive type probability conditions from the candidate concept words, characterizes the entity set, can firstly determine the candidate concept words in the candidate text by carrying out grammar analysis on the candidate text, carries out positive type probability prediction on the candidate concept words based on the target neural network model, and further screens out the concept words meeting the positive type probability conditions without being limited by the specific types of the candidate text, thereby being beneficial to expanding the application scene of the concept word screening method. Further, through a cascading mode, candidate concept words are determined based on grammar analysis, and then concept words are screened from the candidate concept words based on the target neural network model, so that the complexity of the target neural network model can be reduced, the number of the candidate concept words input into the target neural network model is reduced, and the data processing efficiency can be improved while the screening accuracy of the concept words is ensured.
As described above, the grammar conditions may be combined from at least two grammar feature conditions. Under the condition that the grammar conditions are formed by combining a plurality of grammar characteristic conditions, the server can respectively compare the grammar characteristic information of the same candidate text with each grammar characteristic condition one by one, and determine candidate concept words which simultaneously accord with each grammar characteristic condition; the server can also adopt a sequential comparison mode to compare the grammar characteristic information of the candidate text with the first grammar characteristic condition, and after removing a part of candidate texts which do not accord with the first grammar characteristic condition, the grammar characteristic information of the rest candidate texts is compared with the second grammar characteristic condition, and so on until the comparison of the last grammar characteristic condition is completed.
In one embodiment, the grammar conditions include a character composition condition and a part-of-speech sequence condition. In the case of this embodiment, as shown in fig. 4, step S203 includes:
step S402, screening candidate texts meeting character composition conditions from the candidate texts.
The character composition condition refers to a grammar condition set based on character composition of the candidate text. The character composition condition may include at least one of "target character is included in the candidate text", and "the position of the target character in the candidate text is the target position". Specifically, based on the grammar analysis result of the candidate text, the server can determine character characteristic information such as characters forming the candidate text and positions of the characters in the candidate text, and then compare the character characteristic information with character composition conditions, so that the character comparison result of the candidate text can be determined, and further, candidate texts meeting the character composition conditions can be obtained through screening from the candidate text.
The number of character composition conditions is not limited, and only one character composition condition may be set, or a plurality of different character composition conditions may be set. In one embodiment, the character composition conditions include a first character composition condition and a second character composition condition. In the case of this embodiment, step S304 includes: screening the candidate texts to obtain first candidate texts meeting the first character composition conditions; screening a second candidate text meeting the second character composition condition from the first candidate text; the second candidate text is determined to be the candidate text.
The first character composition condition and the second character composition condition may be the same kind of character composition condition or different kinds of character composition condition. For example, when the first character composition condition is "the first target character is included in the candidate text", the second character composition condition may be "the second target character is included in the candidate text", or "the position of the second target character in the candidate text is the target position".
Specifically, the server may obtain character feature information of the candidate text based on a result of the grammar analysis of the candidate text, compare the character feature information with a first character composition condition, screen a first candidate text conforming to the first character composition condition from the candidate text, screen a second candidate text conforming to a second character composition condition from the first candidate text based on the character feature information of the first candidate text, and determine the second candidate text as the candidate text.
It will be appreciated that in other embodiments, further character composition conditions may be set, and the filtering may be performed sequentially in a cascade until the candidate text is obtained. As shown in fig. 5, three character composition conditions are set in cascade in order, and each level of character composition conditions is a grammar removal condition. The first-level character composition condition is that the candidate text contains a first target character, the second-level character composition condition is that the candidate text ends with a second target character, and the third-level character composition condition is that the initial character of the candidate text is a third target character. All the candidate texts are subjected to a first stage, the server judges whether each candidate text meets the first-stage character composition condition, if yes, the candidate text is discarded, and otherwise, the second stage is entered; in the judging process of the second stage, discarding the candidate text if the second-stage character composition condition is met, otherwise, entering a third stage; in the judging process of the third stage, the server discards the candidate texts meeting the third-stage character composition condition and reserves the candidate texts which do not meet the third-stage character composition condition as candidate texts.
In the above embodiment, the plurality of character component conditions are sequentially determined in a cascade manner, so that the workload in the determination process of the rest character component conditions except the first character component condition can be reduced, and the improvement of the working efficiency is facilitated.
Step S403, part-of-speech tagging is carried out on the candidate text, and a part-of-speech sequence of the candidate text is obtained.
And step S404, screening candidate concept words with part-of-speech sequences conforming to part-of-speech sequence conditions from the candidate texts.
The part-of-speech tagging is a text data processing technology for tagging part-of-speech of candidate words constituting a candidate text according to their meaning and context, and is also called grammar tagging and part-of-speech disambiguation. The specific method for labeling the part of speech of the candidate text can be a part of speech labeling method based on rules, a part of speech labeling method based on statistical models, a part of speech labeling method based on combination of statistics and rules, and the like, in short, the specific method for labeling the part of speech is not limited in this embodiment. Further, the part-of-speech sequence refers to a sequence formed by part-of-speech tags, and can represent the part of speech of each candidate word in the candidate text and the position of each candidate word of the part of speech in the text. For example, after part of speech labeling is performed on the candidate text "which countries exist in seven continents respectively", the obtained part of speech labeling result is "which countries/n exist in seven continents/n/d/r/n", and the corresponding part of speech sequence is "/n/d/v/r/n", where n represents noun, d represents adverbs, v represents verbs, and r represents pronouns. The part-of-speech sequence condition may include at least one of "candidate words in the candidate text that include the target part of speech", and "the location of the candidate word of the target part of speech in the candidate text is the target location".
Specifically, the server marks the part of speech of the candidate text, so that the part of speech sequence of the candidate text can be obtained, the part of speech sequence is compared with the part of speech sequence condition, the candidate text of which the part of speech sequence accords with the part of speech sequence condition can be obtained through screening from the candidate text, and the candidate text which accords with the part of speech sequence condition is determined to be a candidate concept word. It will be appreciated that, in the case where the candidate text conforming to the part-of-speech sequence condition is a sentence, candidate concept words conforming to the grammar extraction condition need to be further extracted therefrom.
In other embodiments, part-of-speech tagging may be performed on the candidate text to obtain a part-of-speech sequence of the candidate text, candidate texts with part-of-speech sequences conforming to the part-of-speech sequence condition may be obtained by screening from the candidate texts, and candidate concept words conforming to the character composition condition may be obtained by screening from the candidate texts.
In the above embodiment, the candidate concept words are obtained by comprehensively considering the character composition condition and the part-of-speech sequence condition, which is equivalent to comprehensively considering the grammar characteristic information of various types in the process of screening the candidate concept words, so that the accuracy of the candidate concept word screening result is improved, and the judgment of the grammar conditions is sequentially carried out in a cascading manner, so that the work efficiency is improved.
As previously described, the target neural network model may comprise only one neural network model, or may comprise a plurality of different sub-models. Under the condition that the target neural network model comprises a plurality of different sub-models and each sub-model is correspondingly provided with a sub-positive class probability condition, the server can respectively conduct positive class probability prediction on the same candidate concept word based on each sub-model, and determine the concept words which simultaneously meet the sub-positive class probability conditions corresponding to each sub-model; the server can also reject the candidate concept words of which the sub positive class probability conditions corresponding to the current sub model are not met based on the current sub model, and then conduct positive class probability prediction on the rest candidate concept words based on the next sub model.
In one embodiment, the target neural network model includes a plurality of cascaded sub-models. In the case of this embodiment, please continue with reference to fig. 4, step S205 includes step S405: and respectively carrying out sub-positive class probability prediction on the candidate concept words through a plurality of cascaded sub-models to obtain sub-positive class probabilities of the candidate concept words. Step S207 includes step S407: and screening concept words which simultaneously meet the sub positive class probability conditions corresponding to the sub models from the candidate concept words.
The sub model is a neural network model based on a positive class sample consisting of concept words and a negative class sample which meets the eliminating condition of the sub concept words. For specific definitions of the sub-concept word eliminating condition and the sub-positive class probability condition, reference is made to the above definition of the concept word eliminating condition and the positive class probability condition, respectively, and the detailed description is omitted here. The input of the first-level submodel is a candidate concept word, and the input of each level submodel except the first-level submodel is an alternative concept word meeting the sub positive class probability condition corresponding to the previous level submodel. Based on this, the number of resulting sub-positive class probabilities is not unique for different candidate concept words. For example, for a candidate concept word that does not satisfy the sub-positive class probability condition of the first level sub-model, since it will not be input as an alternative concept word to the next level sub-model, the candidate concept word corresponds to only one sub-positive class probability; for candidate concept words meeting the sub-positive class probability conditions corresponding to all sub-models, the number of sub-positive class probabilities corresponding to the candidate concept words is the same as the number of sub-models because each level of sub-models conduct sub-positive class probability prediction on the candidate concept words.
Specifically, the server performs positive class probability prediction on candidate concept words based on the first-stage sub-model to obtain candidate concept words meeting the first positive class probability condition, performs positive class probability prediction on candidate concept words based on the second-stage sub-model, and so on until positive class probability condition discrimination of the last-stage sub-model is completed, and obtains concept words meeting the sub-positive class probability condition corresponding to the last-stage sub-model. Therefore, on one hand, the complexity of each level of sub-model is reduced, and on the other hand, the data volume of alternative concept words which need to be processed by each level of sub-model is reduced, so that the screening accuracy of the concept words can be ensured, and meanwhile, the screening efficiency can be improved.
In one embodiment, the target neural network model includes a first level sub-model and a second level sub-model. In the case of this embodiment, step S205 includes: performing first positive class probability prediction on candidate concept words based on the first-level sub-model, and determining candidate concept words meeting first sub-positive class probability conditions in the candidate concept words; and carrying out second positive class probability prediction on the alternative concept words based on the second-level sub-model, and determining second positive class probabilities of the alternative concept words.
The sample categories of the negative class samples used for training and obtaining the second-level submodel are more than the sample categories of the negative class samples used for training and obtaining the first-level submodel. That is, the category of the concept word eliminating condition corresponding to the second level submodel is more than the category of the concept word eliminating condition corresponding to the first level submodel. Specifically, the server firstly carries out first positive class probability prediction on candidate concept words based on the first level submodel, if the candidate concept words meet the first sub positive class probability conditions, the candidate concept words are determined to be candidate concept words, and otherwise, the candidate concept words are discarded. And then, the server predicts the second positive class probability of the alternative concept word based on the second-level submodel to obtain the second positive class probability of the alternative concept word. Further, after the second sub positive class probability is obtained, the server may further determine the candidate concept word as a concept word if the second sub positive class probability condition is satisfied in the candidate concept word, or discard the candidate concept word. It will be appreciated that the complexity and accuracy of the second level sub-model is higher than that of the first level sub-model, but the processing speed of the second level sub-model is slower than that of the first level sub-model, since there are more sample classes to train the negative class samples of the second level sub-model than those of the negative class samples of the first level sub-model. By adopting the scheme of the embodiment, the first-stage submodel with relatively high processing speed has more data to be processed than the second-stage submodel, and the processing speed and the accuracy can be simultaneously considered, thereby being beneficial to improving the scientificity of the concept word screening method.
It should be noted that, before the candidate concept words are predicted with positive class probability based on the target neural network model, and the concept words satisfying the positive class probability condition in the candidate concept words are determined, the target neural network model is further required to be trained. In one embodiment, the target neural network model includes a text prediction model; the process of training to obtain the text prediction model comprises the following steps: acquiring a first positive type sample and a first negative type sample; and performing model training by using the first positive class sample and the first negative class sample to obtain a text prediction model for performing text feature matching analysis.
The text prediction model can be any one of a CNN model, an RNN model and an LSTM model. The first positive class sample comprises at least one class of entity descriptions belonging to concept words, concept words constructed based on entity attributes of the knowledge graph and manually marked concept words; the first negative class sample includes non-conceptual words that satisfy the grammar culling condition. Specifically, the server may obtain a first positive class sample and a first negative class sample sent by the developer through the terminal, perform model training by using the first positive class sample and the first negative class sample, and obtain a text prediction model for performing text feature matching analysis when a model loss function is minimized. The model loss function may be any one or a combination of relative entropy loss functions, cross entropy loss functions, and weighted cross entropy loss functions.
In one embodiment, the target neural network model includes a semantic characterization model; the process of training to obtain the semantic characterization model comprises the following steps: acquiring a second positive type sample and a second negative type sample; and performing model training by using the second positive class sample and the second negative class sample to obtain a semantic characterization model for performing semantic feature matching analysis.
The semantic prediction model can be any one of a BERT (Bidirectional Encoder Representation from Transformers, language characterization) model, an XLNet model and an AlBert model. The second positive class sample comprises at least one class of entity descriptions belonging to concept words, concept words constructed based on entity attributes of the knowledge graph and manually marked concept words; the second negative class sample includes at least one of non-concept words satisfying a grammar culling condition, non-concept words obtained by changing character constitution of the concept words in the second positive class sample, and entity names. Further, a specific way to change the character composition of the concept word in the second positive class sample may be to remove the character of the target position in the concept word, where the target position may be any one of the first and last positions; or the position of the character in the exchange concept word. For example, a second negative class of samples that may be derived from the second positive class of samples "chinese actor" include "chinese actor", "chinese actor" and so forth. In addition, since the entity set characterized by the concept word includes at least two entity objects having common characteristics, an entity name specifying a certain entity object must not be regarded as the concept word.
Specifically, the server may obtain a second positive class sample and a second negative class sample sent by the developer through the terminal, and perform model training by using the second positive class sample and the second negative class sample, and when the model loss function is minimized, a semantic prediction model for performing semantic feature matching analysis may be obtained. The model loss function may be any one or a combination of relative entropy loss functions, cross entropy loss functions, and weighted cross entropy loss functions.
In the above embodiment, the training method of the target neural network model of various categories is provided, and the screening of the concept words can be performed from different dimensions, so that the accuracy of the screening result of the concept words is improved.
In one embodiment, as shown in fig. 7, the concept word screening method includes:
step S701, acquiring candidate texts;
step S702, word segmentation processing is carried out on the candidate text, and a word segmentation processing result of the candidate text is obtained;
step S703, screening and obtaining a first candidate text conforming to the first character composition condition from the candidate texts based on the word segmentation processing result;
step S704, screening a second candidate text meeting the second character composition condition from the first candidate text based on the word segmentation processing result of the first candidate text;
Step S705, determining the second candidate text as an alternative text;
step S706, part-of-speech tagging is carried out on the candidate text, and a part-of-speech sequence of the candidate text is obtained;
step S707, screening candidate concept words with part-of-speech sequences conforming to part-of-speech sequence conditions from the candidate texts;
step S708, carrying out first positive class probability prediction on candidate concept words based on the first-level sub-model, and determining candidate concept words meeting first positive class probability conditions in the candidate concept words;
step S709, second positive class probability prediction is carried out on the candidate concept words based on the second-level submodel, and concept words meeting the second positive class probability condition in the candidate concept words are determined.
The candidate text refers to the text serving as a candidate in the concept word screening process. The definition of the specific type of candidate text is referred to above and will not be repeated here. The sample class used to train the negative class sample of the second level submodel is more than the sample class used to train the negative class sample of the first level submodel. The first level sub-model may be a text prediction model for text feature matching analysis; the second level sub-model may be a semantic characterization model for semantic feature matching analysis. The first positive class sample used for training to obtain the first-level sub-model and the second positive class sample used for training to obtain the second-level sub-model can comprise at least one of entity description belonging to a concept word, a concept word constructed based on entity attributes of a knowledge graph and a manually marked concept word. The first negative class sample for training to obtain the first level sub-model and the second negative class sample for training to obtain the second level sub-model can comprise at least one of non-concept words meeting grammar eliminating conditions, non-concept words obtained by changing character constitution of concept words in the second positive class sample and entity names.
Specifically, the service firstly extracts text features of candidate concept words, performs text feature matching analysis on the text features based on a first-level submodel to obtain first positive class probability of the candidate concept words, and screens candidate concept words with the first positive class probability meeting a first sub positive class probability condition from the candidate concept words. Then, the server extracts semantic features of the candidate concept words, performs semantic feature matching analysis on the semantic features based on the second-level submodel to obtain second positive class probabilities of the candidate concept words, and screens concept words with the second positive class probabilities meeting second sub positive class probability conditions from the candidate concept words.
According to the concept word screening method, the candidate concept words in the candidate text can be determined by carrying out grammar analysis on the candidate text, then the positive class probability prediction is carried out on the candidate concept words based on the cascaded neural network model, and the concept words meeting the positive class probability condition are further screened to obtain the concept words which are not limited by the specific types of the candidate text, so that the application scene of the concept word screening method is expanded; through a cascading mode, the concept words are determined by adopting a discrimination mode with gradually increased complexity, so that the complexity of each level of neural network model can be improved, and the data processing efficiency can be improved while the screening accuracy of the concept words is ensured.
The concept word mining method provided by the application can be applied to application scenes such as query word understanding, recommendation, search question and answer, encyclopedia collection products and the like. In one embodiment, the application further provides an application scene of hot concept word mining, and the application scene applies the concept word screening method. Under the application scene of the hot point concept word mining, the server acquires candidate hot point texts, wherein the candidate hot point texts can be hot point query texts in a target time interval obtained based on search engine statistics or more hot point texts in the target time interval obtained based on content interaction platform statistics. The target time interval may be within 24 hours, the last natural month or year, etc. Then, the server performs word segmentation processing on the candidate hot spot text to obtain a word segmentation processing result of the candidate hot spot text, then performs grammar analysis on the candidate hot spot text based on the word segmentation processing result to determine candidate concept words meeting grammar conditions in the candidate hot spot text, and finally, the server performs positive class probability prediction on the candidate concept words based on a positive class sample consisting of the concept words and a negative class sample training obtained by non-concept words meeting concept word rejection conditions to determine hot point concept words meeting positive class probability conditions in the candidate concept words and representing an entity set. The set of entities includes at least two entity objects having a common characteristic. Further, through the obtained hot concept words, encyclopedic product pages of the hot concept words can be added, and the hot concept words are landed in encyclopedic products, so that a complete production flow of concept mining-hot concept discovery-landing pages is realized.
In one embodiment, the application further provides an application scene for searching the concept word mining, and the application scene applies the concept word screening method. In the application scenario of query concept word mining, as shown in fig. 8, the server discriminates the query words obtained from the user terminal through three cascaded discriminators, and screens the query concept words.
The first-level discriminator is used for carrying out grammar analysis on the query words and determining first-level positive class concepts conforming to grammar conditions. As shown in fig. 8, the grammar conditions specifically include "blackened character rule", "blackened suffix rule", "blackened prefix rule", and "lexical rule". Further, "blacked out character rule" means: and if the query word contains the darkened character, the query word is a non-conceptual word. The darkened characters may include question and answer intent characters (e.g., "why", "what", etc.), comparison class characters (e.g., "contrast"), and so forth. The "blacked suffix rule" refers to: the query term is a non-conceptual term as long as it ends with a blackened suffix character. The blackened suffix character may include an information-type suffix (e.g., "introduction"), a question-answer-type suffix (e.g., "height"), and so forth. "blacked prefix rule" means: as long as the query term starts with a blackened prefix character, the query term is a non-conceptual term. The darkened prefix character may include "get" and the like. The "lexical rule" may refer to: whenever a query term ends with a noun, the query term is a candidate concept term. Such as "XXX appointment" is a non-concept word. It is to be understood that the "blackened character rule" may be taken as a specific example of the character composition condition of "target character is contained in the candidate text" in the above, wherein the target character is a blackened character. The "blacked out suffix rule" and the "blacked out prefix rule" may be given as specific examples of the character composition condition of "the position of the target character in the candidate text is the target position" hereinabove. Wherein, the target character corresponding to the black suffix rule is the black suffix character, and the corresponding target position is the end position of the candidate text; the target character corresponding to the blackened prefix rule is the blackened prefix character, and the corresponding target position is the initial position of the candidate text. Similarly, the "lexical rule" may be used as a specific example of the part of speech condition that "the position of the candidate word of the target part of speech in the candidate text is the target position", where the target part of speech is a noun and the target position is the ending position.
Specifically, the server judges four types of rules, namely a blackened character rule, a blackened suffix rule, a blackened prefix rule and a lexical rule, of the query word in sequence. The first-stage server judges the 'blacking character rule' of the query word, if the query word contains blacking characters, the query word is directly discarded, and otherwise, the second-stage server enters; the second stage server judges the blacked suffix rule of the positive type sample obtained in the first stage, if the positive type sample ends with the blacked suffix character, the positive type sample is directly discarded, otherwise, the third stage is entered; the third-stage server judges the positive type sample of the second stage according to the blackened prefix rule, if the query word starts with the blackened prefix character, the query word is directly discarded, otherwise, the fourth stage is entered; the fourth stage server can judge the 'lexical rule' of the positive class sample of the third stage, if the query word ends with nouns, the query word is determined to be a first-stage positive class concept and enters a second-stage discriminator, otherwise, the query word is directly discarded.
Further, please continue to refer to fig. 8, the second-level arbiter first converts the first-level positive-class concept into a word vector by using the word embedding layer, and then performs positive-class probability prediction on the first-level positive-class concept through the pre-trained CNN model and softmax function to obtain a first positive-class probability p that the first-level positive-class concept belongs to the positive class i And determining the first-level positive class concept with the first positive class probability meeting the first positive class probability condition as the second-level positive class concept. I.e.Second-level prediction categoryThe method comprises the following steps:
where the value range of α is greater than 0.5 and less than 1, for example, α may be set to 0.8. After passing through the secondary discriminator, the prediction category is finally reservedA sample of 1 is taken as a secondary positive class concept.
After the secondary positive class concept is obtained, the server inputs the secondary positive class concept into a three-level discriminator, positive class probability prediction is carried out on the secondary positive class concept based on the BERT model and the softmax function, second positive class probability that the secondary positive class concept belongs to a positive class is obtained, and then the secondary positive class concept that the second positive class probability meets the second positive class probability condition is determined to be the three-level positive class concept. The second positive class probability condition may be that the second positive class probability is greater than or equal to a second probability threshold. Similarly, the range of values of the second probability threshold is greater than 0.5 and less than 1.
In addition, before using the secondary and tertiary discriminators, a training sample is also required to be constructed, and a corresponding target neural network model is obtained based on training of the training sample. The first positive type sample and the first negative type sample of the secondary discriminator are obtained by adopting a manual construction mode. The first positive class sample may include: entity descriptions belonging to conceptual words, such as "chinese actor"; concept words constructed according to the knowledge map entity attribute, such as 'Tibetan book of composer A'; manually annotated concept words. The first negative class sample may include non-conceptual words that are statistically derived from "blacked suffix rules". Specifically, the server acquires a first positive class sample and a first negative class sample, performs model training based on the first positive class sample and the first negative class sample, and can obtain a target neural network model corresponding to the second-level discriminator by minimizing a loss function.
Wherein, the formula of the loss function is:
wherein N represents the number of training samples, y i Representing the class of training sample i, y if a positive class sample i Equal to 1, otherwise y i Equal to 0.P is p i Representing the probability that sample i is predicted to be a positive class sample.
Likewise, the second positive class sample and the second negative class sample of the three-level discriminator are obtained by adopting a manual construction mode. The second positive class sample may include: entity descriptions belonging to conceptual words, such as "chinese actor"; concept words constructed according to the knowledge map entity attribute, such as 'Tibetan book of composer A'; manually annotated concept words. The second negative class sample may include non-concept words rejected in the first class arbiter, such as "XXX height"; removing non-conceptual words constructed by the first word from the second positive class sample, e.g., the positive class sample "chinese actor" may generate the negative class sample "chinese actor"; based on the negative class sample obtained by the custom rule, for example, a rule construction negative sample of 'entity+type', such as 'C actor', can be defined; entity names, such as "cell phone".
Specifically, the server acquires a second positive class sample and a second negative class sample, performs model training based on the second positive class sample and the second negative class sample, and can obtain a target neural network model corresponding to the three-level discriminator by minimizing a loss function. The loss function may also be a cross entropy loss function.
The query concept word screening method can realize direct screening of the query words, is not limited by the specific types of the query words, can greatly improve the number of the concept words obtained by excavation, and improves the output. As shown in the following table, by adopting the concept word screening method provided by the application, the number of the concept words produced per day can be improved by 5.2 times compared with the conventional method, and the number of the newly added concept words per day can be improved by 12.3 times compared with the conventional method.
Method Number of concept words produced daily The number of concepts is increased more recently
Conventional method 13.5 ten thousand 2.4 ten thousand
The method of the application 84.2 ten thousand 32 ten thousand
Furthermore, by adopting the mode of the application, the input is the query word, the output is the three-level positive class concept obtained by screening after three cascade discriminators, the complexity of each level of discriminators is sequentially increased, and the discrimination accuracy is also better. If a simple first-stage discriminator is directly used, the speed can meet the requirement of mass data processing, but the final accuracy cannot be ensured; if the complex discriminant is directly used, the accuracy can be ensured, but the processing requirement of massive query words cannot be met in speed due to the fact that the model parameters of the complex discriminant are too large. According to the application, the cascade discriminator is used to effectively solve the problem of too slow speed caused by massive query words of the search engine, and the data processing efficiency is improved. As shown in the following table, the speed of the cascade discriminator provided by the application can be improved by 5.5 times compared with that of a one-stage discriminator of a traditional method.
Method Processing is time consuming
Primary discriminator (traditional method) 26 hours
Cascade discriminator (method of the application) 4 hours
In addition, by using the cascade discriminators, the difficulty of model learning is reduced, and the accuracy of classification results is improved. As shown in the table below, the accuracy can be improved by 7 percent by using the cascade discriminator provided by the application compared with the one-stage discriminator of the traditional method.
Method Accuracy rate of
Primary discriminator (traditional method) 85%
Cascade discriminator (method of the application) 92%
Therefore, the concept word screening method provided by the application is not limited by the specific type of the candidate text, and can be used for considering the accuracy and the processing efficiency.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a concept word screening device for realizing the above related concept word screening method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the concept word screening device or devices provided below may refer to the limitation of the concept word screening method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 9, there is provided a concept word screening apparatus 900, including: an acquisition module 901, a candidate concept word determination module 902, and a concept word determination module 903, wherein:
an obtaining module 901, configured to obtain a candidate text;
a candidate concept word determining module 902, configured to parse the candidate text to determine candidate concept words in the candidate text that conform to the grammar condition;
the positive class probability prediction module 903 is configured to perform positive class probability prediction on the candidate concept word based on the target neural network model, so as to obtain a positive class probability of the candidate concept word; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting the concept word eliminating condition;
A concept word determining module 904, configured to screen concept words that satisfy the positive class probability condition from candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
In one embodiment, the grammar conditions include a character composition condition and a part-of-speech sequence condition. In the case of this embodiment, candidate concept word determination module 902 includes: the candidate text determining unit is used for screening candidate texts which meet the character composition conditions from the candidate texts; the part-of-speech tagging unit is used for tagging the part of speech of the candidate text to obtain a part-of-speech sequence of the candidate text; and the candidate concept word determining unit is used for screening candidate concept words with part-of-speech sequences meeting part-of-speech sequence conditions from the candidate texts.
In one embodiment, the character composition conditions include a first character composition condition and a second character composition condition. In the case of this embodiment, the alternative text determination unit is specifically for: screening the candidate texts to obtain first candidate texts meeting the first character composition conditions; screening a second candidate text meeting the second character composition condition from the first candidate text based on the word segmentation processing result of the first candidate text; the second candidate text is determined to be the candidate text.
In one embodiment, the positive class probability prediction module 903 includes: a feature extraction unit for extracting features of candidate concept words; and the positive class probability prediction unit is used for carrying out feature matching analysis on the features of the candidate concept words to obtain the positive class probability of the candidate concept words.
In one embodiment, the target neural network model includes a plurality of cascaded sub-models. In the case of this embodiment, the positive class probability prediction module 903 is specifically configured to: and respectively carrying out positive class probability prediction on the candidate concept words through a plurality of cascaded sub-models to obtain sub-positive class probabilities of the candidate concept words. The concept word determining module 904 is specifically configured to: and screening concept words which simultaneously meet the sub positive class probability conditions corresponding to the sub models from the candidate concept words. The input of the first-level submodel is a candidate concept word, and the input of each level submodel except the first-level submodel is an alternative concept word meeting the sub positive class probability condition corresponding to the previous level submodel.
In one embodiment, the target neural network model includes a first level sub-model and a second level sub-model. In the case of this embodiment, the positive class probability prediction module 903 includes: the candidate concept word screening unit is used for carrying out first positive class probability prediction on candidate concept words based on the first-level sub-model, and determining candidate concept words meeting first sub-positive class probability conditions in the candidate concept words; and the second positive class probability prediction unit is used for carrying out second positive class probability prediction on the candidate concept words based on the second-level sub-model and determining second sub-positive class probabilities of the candidate concept words. The sample categories of the negative class samples used for training and obtaining the second-level submodel are more than the sample categories of the negative class samples used for training and obtaining the first-level submodel.
In one embodiment, the concept word screening apparatus 900 further includes a model training module for training to obtain a target neural network model based on a positive class sample composed of concept words and a negative class sample composed of non-concept words.
In one embodiment, the target neural network model includes a text prediction model. In the case of this embodiment, the model training module is specifically configured to: acquiring a first positive type sample and a first negative type sample; and performing model training by using the first positive class sample and the first negative class sample to obtain a text prediction model for performing text feature matching analysis. The first positive class sample comprises at least one of entity descriptions belonging to concept words, concept words constructed based on entity attributes of a knowledge graph and manually marked concept words; the first negative class sample includes non-conceptual words that satisfy the grammar culling condition.
In one embodiment, the target neural network model includes a semantic characterization model. In the case of this embodiment, the model training module is specifically configured to: acquiring a second positive type sample and a second negative type sample; and performing model training by using the second positive class sample and the second negative class sample to obtain a semantic characterization model for performing semantic feature matching analysis. The second positive class sample comprises at least one of entity descriptions belonging to concept words, concept words constructed based on entity attributes of the knowledge graph and manually marked concept words; the second negative class sample includes at least one of non-concept words satisfying a grammar culling condition, non-concept words obtained by changing character constitution of the concept words in the second positive class sample, and entity names.
The respective modules in the above conceptual word screening apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing candidate text. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a concept word screening method. It will be appreciated by persons skilled in the art that the architecture shown in fig. 10 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applicable, and that a computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (12)

1. A concept word screening method, the method comprising:
acquiring candidate texts;
carrying out grammar analysis on the candidate text, and determining candidate concept words meeting grammar conditions in the candidate text;
based on a target neural network model, carrying out positive class probability prediction on the candidate concept words to obtain positive class probability of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
Screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
2. The method of claim 1, wherein the grammar conditions include character composition conditions and part-of-speech sequence conditions; the step of analyzing the candidate text in grammar, and determining candidate concept words meeting grammar conditions in the candidate text comprises the following steps:
screening candidate texts meeting character composition conditions from the candidate texts;
part-of-speech tagging is carried out on the candidate text, and a part-of-speech sequence of the candidate text is obtained;
and screening candidate concept words, of which the part-of-speech sequences accord with part-of-speech sequence conditions, from the candidate texts.
3. The method of claim 2, wherein the character composition conditions include a first character composition condition and a second character composition condition; screening candidate texts meeting character composition conditions from the candidate texts, wherein the candidate texts comprise:
screening the candidate texts to obtain first candidate texts meeting the first character composition conditions;
Screening the first candidate texts to obtain second candidate texts meeting the second character composition conditions;
and determining the second candidate text as an alternative text.
4. A method according to any one of claims 1 to 3, wherein the performing positive class probability prediction on the candidate concept words based on the target neural network model to obtain the positive class probability of the candidate concept words includes:
extracting the characteristics of the candidate concept words;
and carrying out feature matching analysis on the features of the candidate concept words to obtain positive class probabilities of the candidate concept words.
5. A method according to any one of claims 1 to 3, wherein the target neural network model comprises a plurality of cascaded sub-models; the target neural network model-based positive class probability prediction is carried out on the candidate concept words to obtain positive class probability of the candidate concept words, and the method comprises the following steps:
respectively carrying out sub-positive class probability prediction on the candidate concept words through a plurality of cascaded sub-models to obtain sub-positive class probabilities of the candidate concept words;
the step of screening concept words meeting the positive class probability condition from the candidate concept words comprises the following steps:
Screening concept words which simultaneously meet the sub positive class probability conditions corresponding to the sub models from the candidate concept words; the input of the first-level submodel is the candidate concept word, and the input of each level submodel except the first-level submodel is the candidate concept word meeting the sub positive class probability condition corresponding to the previous level submodel.
6. The method of claim 5, wherein the target neural network model comprises a first level submodel and a second level submodel; the target neural network model-based positive class probability prediction is carried out on the candidate concept words to obtain positive class probability of the candidate concept words, and the method comprises the following steps:
performing first positive class probability prediction on the candidate concept words based on a first-level sub-model, and determining candidate concept words meeting a first sub-positive class probability condition in the candidate concept words;
performing second positive class probability prediction on the alternative concept words based on a second-level sub-model, and determining second sub-positive class probabilities of the alternative concept words; more sample classes are used to train the negative class samples from the second level submodel than from the first level submodel.
7. The method of claim 5, wherein the target neural network model comprises a text prediction model; the process of training to obtain the text prediction model comprises the following steps:
acquiring a first positive type sample and a first negative type sample; the first positive class sample comprises at least one of entity descriptions belonging to concept words, concept words constructed based on entity attributes of a knowledge graph and manually marked concept words; the first negative class sample comprises non-concept words meeting grammar eliminating conditions;
and performing model training by using the first positive class sample and the first negative class sample to obtain a text prediction model for performing text feature matching analysis.
8. The method of claim 5, wherein the target neural network model comprises a semantic characterization model; the process of training to obtain the semantic characterization model comprises the following steps:
acquiring a second positive type sample and a second negative type sample; the second positive class sample comprises at least one of entity descriptions belonging to concept words, concept words constructed based on entity attributes of a knowledge graph and manually marked concept words; the second negative class sample comprises at least one of non-concept words meeting grammar rejection conditions, non-concept words obtained by changing character constitution of the concept words in the second positive class sample and entity names;
And performing model training by using the second positive class sample and the second negative class sample to obtain a semantic characterization model for performing semantic feature matching analysis.
9. A concept word screening apparatus, the apparatus comprising:
the acquisition module is used for acquiring candidate texts;
the candidate concept word determining module is used for carrying out grammar analysis on the candidate text and determining candidate concept words meeting grammar conditions in the candidate text;
the positive class probability prediction module is used for carrying out positive class probability prediction on the candidate concept words based on a target neural network model to obtain positive class probabilities of the candidate concept words; the target neural network model is obtained by training based on a positive class sample composed of concept words and a negative class sample composed of non-concept words meeting concept word rejection conditions;
the concept word determining module is used for screening concept words meeting the positive class probability condition from the candidate concept words; the concept words are used for representing the entity set; the set of entities includes at least two entity objects having a common characteristic.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.
CN202210616004.7A 2022-06-01 2022-06-01 Concept word screening method, device, computer equipment and storage medium Pending CN117216250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210616004.7A CN117216250A (en) 2022-06-01 2022-06-01 Concept word screening method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210616004.7A CN117216250A (en) 2022-06-01 2022-06-01 Concept word screening method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117216250A true CN117216250A (en) 2023-12-12

Family

ID=89037614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210616004.7A Pending CN117216250A (en) 2022-06-01 2022-06-01 Concept word screening method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117216250A (en)

Similar Documents

Publication Publication Date Title
CN111222305B (en) Information structuring method and device
CN109815336B (en) Text aggregation method and system
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN111930936A (en) Method and system for excavating platform message text
CN115017335A (en) Knowledge graph construction method and system
CN111428502A (en) Named entity labeling method for military corpus
CN117076946A (en) Short text similarity determination method, device and terminal
CN111859955A (en) Public opinion data analysis model based on deep learning
CN115827871A (en) Internet enterprise classification method, device and system
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN115329754A (en) Text theme extraction method, device and equipment and storage medium
CN113688233A (en) Text understanding method for semantic search of knowledge graph
CN114003706A (en) Keyword combination generation model training method and device
CN117216250A (en) Concept word screening method, device, computer equipment and storage medium
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination