CN111831821A

CN111831821A - Training sample generation method and device of text classification model and electronic equipment

Info

Publication number: CN111831821A
Application number: CN202010493959.9A
Authority: CN
Inventors: 刘昊; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-27
Anticipated expiration: 2040-06-03
Also published as: CN111831821B

Abstract

The application discloses a method and a device for generating training samples of a text classification model and electronic equipment, and relates to the technical field of natural language processing and deep learning. The implementation scheme is as follows: the method comprises the steps of obtaining seed words of a target content type, obtaining the seed words as search words, searching to obtain multi-piece target texts, marking the multi-piece target texts as training samples respectively to generate a training sample set of the target content type, generating keywords according to a plurality of training samples in the training sample set, updating the seed words according to the keywords, using the updated seed words as the search words, searching again, marking the target texts searched again as the training samples, and adding the target texts to the training sample set. According to the method and the device, the keywords are extracted from the existing training samples, the seed words are updated by the keywords, and then more training samples are obtained by searching the updated seed words, so that the samples are not required to be generated manually, the cost is reduced, and the generation efficiency is improved.

Description

Training sample generation method and device of text classification model and electronic equipment

Technical Field

The present application relates to the field of computer technology, and in particular, to the field of natural language processing and deep learning.

Background

The text classification is to classify texts into a certain or a few specified categories in a given classification system, and the text classification is a most basic task in Natural Language Processing (NLP) and has a great number of applications in the aspects of content understanding, information retrieval, personalized recommendation and the like.

In text classification, text classification models based on deep learning methods are generally adopted for text classification, and model training requires large-scale high-quality training data. Meanwhile, training sample data required to be constructed in different classification scenes are different, so how to automatically construct large-scale training data based on the classification requirement is a problem to be solved urgently.

Disclosure of Invention

A training sample generation method and device for a text classification model and electronic equipment are provided.

According to the first aspect, the method for generating the training samples of the text classification model is provided, the keywords are extracted from the existing training samples, the seed words are updated by the keywords, and then the updated seed words are used for searching to obtain more training samples, so that the training samples are expanded, manual operation is not needed, and the technical problems of high cost and low efficiency due to the fact that the training sample set is continuously expanded in a manual mode in the prior art are solved.

A second aspect of the present application provides a training sample generation apparatus for a text classification model.

A third aspect of the present application provides an electronic device.

A fourth aspect of the present application proposes a non-transitory computer-readable storage medium storing computer instructions.

According to a first aspect, there is provided a method for generating training samples of a text classification model, the method comprising:

acquiring seed words of a target content type, acquiring the seed words as search words, and searching to obtain a multi-text target text;

marking the target content types by using the multi-space target texts as training samples respectively to generate a training sample set of the target content types;

generating a keyword according to a plurality of training samples in the training sample set;

updating the seed words according to the keywords; and

and taking the updated seed words as search words, searching again, taking the searched target texts as training samples, marking the target content types, and adding the training samples to the training sample set.

According to a second aspect, there is provided a training sample generating apparatus comprising:

the acquisition module is used for acquiring seed words of the target content type, acquiring the seed words as search words, and searching to obtain a multi-text target text;

the marking module is used for marking the target content types by using the multi-space target texts as training samples respectively so as to generate a training sample set of the target content types;

the extraction module is used for generating keywords according to a plurality of training samples in the training sample set;

the updating module is used for updating the seed words according to the keywords; and

and the execution module is used for searching again by taking the updated seed words as search words, marking the target content type by taking the searched target text as a training sample, and adding the training sample set with the target content type.

According to a third aspect, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training sample generation method of the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the training sample generation method of the first aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the method comprises the steps of obtaining seed words of a target content type, obtaining the seed words as search words, searching to obtain a multi-piece target text, marking the multi-piece target text as training samples to be marked with the target content type to generate a training sample set of the target content type, generating keywords according to a plurality of training samples in the training sample set, updating the seed words according to the keywords, using the updated seed words as the search words, searching again, using the searched target text as the training samples to be marked with the target content type, and adding the target text to the training sample set. According to the method and the device, the keywords are extracted from the existing training samples, the seed words are updated through the keywords, and then more training samples are obtained through the updated seed words, so that the training samples are expanded, manual operation is not needed, the cost is reduced, and the generation efficiency is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a method for generating training samples of a text classification model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of another method for generating training samples of a text classification model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a training sample generation method for a text classification model according to another embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a training sample generation method for a text classification model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a training sample generation apparatus for a text classification model according to an embodiment of the present application; and

fig. 6 is a block diagram of an electronic device of a training sample generation method of a text classification model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a training sample generation method, a training sample generation device, and an electronic device of a text classification model according to an embodiment of the present application with reference to the drawings.

Fig. 1 is a schematic flowchart of a method for generating a training sample of a text classification model according to an embodiment of the present application.

As shown in fig. 1, the method comprises the steps of:

step 101, acquiring seed words of the target content type, acquiring the seed words as search words, and searching to obtain a multi-text target text.

The target content type refers to a type to which the text to be classified belongs, for example, an education type, an e-commerce company type, and the like. The text classification model in this embodiment can classify texts of various content types.

In this embodiment, the number of seed words of the target content type may be one or more. Wherein, the seed word refers to a word that can be used to indicate a target content type, for example, a content type of a text to be currently classified is an education type, and the education type includes a plurality of sub-types, for example, a kindergarten, a primary school, a middle school, a university, an early education institution, a training institution, and the like. As one possible implementation, one of the education types, e.g., kindergarten, may be considered a seed word because the kindergarten indicates the education type, since the sub-category also belongs to the education category. Similarly, the primary school can be used as a seed word, and the training institution can be used as a seed word. As another possible implementation manner, the plurality of sub-categories may be used as seed words, which are not listed in this embodiment.

In this embodiment, the seed word is obtained as a search word, and the multi-text target text is obtained by searching, wherein the obtaining manner of the multi-text target text may be, as a possible implementation manner, that the seed word is used as a search word, a plurality of texts are obtained by searching in a search engine, and a preset number of texts with a top rank are used as the multi-text target text; or determining the multi-text target text from the multiple texts according to the semantic correlation between the seed words and the multiple texts.

As another possible implementation manner, multiple second search records with the seed word as the search word may be obtained by querying the log, multiple browsing texts with browsing operations may be determined according to the multiple second search records, and the target text may be selected from the multiple browsing texts according to the browsing times of the multiple browsing texts. According to the texts corresponding to the multiple search records obtained by searching, the texts clicked and browsed by the user are regarded as the texts with high correlation with the search words, and the more browsing times, the higher the correlation with the search words, so that the target texts are selected from the multiple browsing texts based on the browsing times, and the accuracy of determining the target texts is improved.

And 102, marking the target content types by using the multi-space target texts as training samples respectively to generate a training sample set of the target content types.

In this embodiment, the obtained multi-space target texts are used as training samples, and the target content types are used to label the multiple training samples, that is, the multi-space target texts used as the training samples all carry labeling information, that is, the target content types, so as to generate a training sample set of the target content types.

It should be noted that the training samples for generating the text classification model in the present application are generated based on the requirements of text classification, for example, if classification is required for an education category, training samples belonging to the category need to be generated; if the education category is classified according to the sub-category, for example, the education category is primary school, preschool education or adult education, training samples of the corresponding sub-category need to be constructed, so that the text classification model can output corresponding primary school education category, early education category or adult education category based on the input text to meet different text classification requirements.

Step 103, generating a keyword according to a plurality of training samples in the training sample set.

In an embodiment of the application, a plurality of training samples in a training sample set are respectively extracted to obtain keywords generated based on the training samples.

And 104, updating the seed words according to the keywords.

In an embodiment of the application, the keyword generated in the above step is used to replace the original seed word, so as to update the seed word.

In another embodiment of the present application, in order to improve the accuracy of the updated seed word, the keyword is selected according to the co-occurrence number of the keyword and the seed word in the training sample set, because the co-occurrence number of the keyword and the seed word in the training sample set indicates the magnitude of the correlation between the keyword and the seed word, and the greater the co-occurrence number of the keyword and the seed word in the training sample set, the greater the correlation between the keyword and the seed word, so that the keyword corresponding to the co-occurrence number meeting the threshold requirement can be used as the selected keyword, so as to update the seed word with the selected keyword, thereby improving the accuracy of the updated seed word.

And 105, taking the updated seed words as search words, searching again, taking the searched target texts as training samples, marking the target content types, and adding the training samples to a training sample set.

In an embodiment of the application, the updated seed words are used as search words, and searching is performed again to obtain target texts, wherein the target texts obtained by searching may be one or more.

As a possible implementation manner, the updated seed words may be used as search words, texts are searched in a search engine, and when the texts are multiple, a preset number of texts with a top rank are used as target texts; or determining a target text from a plurality of texts according to the semantic correlation relationship between the updated seed words and the searched texts.

As another possible implementation manner, multiple search records using the updated seed word as the search word may be obtained by querying the log, multiple browsing texts with browsing operations may be determined according to the multiple search records, and the target text may be selected from the multiple browsing texts according to the browsing times of the multiple browsing texts. According to the texts corresponding to the multiple search records obtained by searching, the texts clicked and browsed by the user are regarded as the texts with high correlation with the search words, and the more browsing times, the higher the correlation with the search words, so that the target texts are selected from the multiple browsing texts based on the browsing times, and the accuracy of determining the target texts is improved.

It should be noted that, according to the required number of training samples, the above steps 103-105 can be performed circularly to continuously expand the number of training samples in the training sample set.

In the training sample generation method of this embodiment, a seed word of a target content type is acquired, the seed word is acquired as a search word, a multi-text target text is obtained through searching, the multi-text target text is respectively labeled as a training sample to generate a training sample set of the target content type, a keyword is generated according to a plurality of training samples in the training sample set, the seed word is updated according to the keyword, the updated seed word is used as the search word, searching is performed again, and the target text which is searched again is labeled as the training sample and added to the training sample set. According to the method and the device, the keywords are extracted from the existing training samples, the original seed words are replaced by the keyword updates, and then the updated seed words are used for searching to obtain more training samples, so that the training samples are expanded, manual operation is not needed, the cost is reduced, and the generation efficiency is improved.

In the previous embodiment, it is described that the seed words are related to the target content type, and the training samples are determined according to the search result of the seed words, that is, the larger the number of seed words, the larger the number of determinable training samples, and therefore, the number of seed words needs to be expanded, thereby increasing the number of generated training samples.

Based on the foregoing embodiment, fig. 2 is a schematic flowchart of a method for generating training samples of another text classification model provided in the embodiment of the present application, and as shown in fig. 2, the method includes the following steps:

step 201, using the category name of the target content type as the initial seed word.

In this embodiment, the category name of the target content type is used as the initial seed word, where the number of the initial seed words is usually small, for example, 1 or 2. In this embodiment, the related seed words are expanded by using the initial seed words, and the following steps will be described in detail.

For example, the text classification model in the present application is to classify the education category, and the education category includes a plurality of sub-categories, wherein each sub-category may be a kindergarten, a primary school, an early education institution, and the like, and thus, the category name of each sub-category may be used as the initial seed word. For convenience of explanation, in this embodiment, a subclass, that is, a kindergarten is taken as an example, and an extended description of seed words is performed, so that an initial seed word may be a category name "kindergarten" of the subclass.

Step 202, the log is queried to obtain the expanded seed words of the same text as the initial seed words.

In this embodiment, the search results recorded in the query log are obtained through the query log, and the search results are based on the search results of different search terms, a plurality of texts obtained by searching with the initial seed term as the search term can be determined from the log, and further, based on the plurality of texts searched with the initial seed term, a search term which is the same as the text searched with the initial seed term is obtained from the log, and the search term is used as an extended seed term, so that the seed term includes not only the initial seed term but also the extended seed term, and the number of the seed terms is increased.

In practical application, when searching is performed through a search word, a part of text search results with poor relevance exist in a searched text, and in order to improve the relevance between the search word and the text, a part of text with poor relevance needs to be screened out. Therefore, the step 202 described above includes the following two possible implementations.

In a possible implementation manner of the embodiment of the present application, step 202 may further include the following steps:

inquiring the log to obtain a plurality of first search records taking the initial seed words as search words;

determining the title of the text with browsing operation according to the first search records;

and using the search word of the searched title as an expansion seed word.

In this embodiment, since the search word is used to perform the search, the titles of the search results are generally displayed, and the text of the browsing operation in the search results generally is a search result with a high matching degree with the search requirement of the user, in this embodiment, the title of the text with the browsing operation is determined from the plurality of first search records using the initial seed word as the search word based on the query log, and the search word with the corresponding title is also searched as the expanded seed word, because the search words with the same search result are similar in semantic, for example, the semantics of the motto team and the kindergarten are similar, and when the motto team and the kindergarten are used as the search words to perform the search, the related text with the title of "kindergarten enters the school" is retrieved, so the motto team can be used as the expanded seed word of the kingarten. In the embodiment, the same text titles which can be searched for the browsing operation are used as the screening conditions, and the expanded seed words of the initial seed words are determined from the search words, so that the expansion of the seed words is realized, and the accuracy of the expanded seed words is improved.

In the above implementation manner, a part of texts with poor relevance are screened out according to whether the titles of the texts are clicked for browsing, and in practical application, the number of times that the titles of the texts are clicked for browsing indicates the magnitude of the relevance between the search word and the text, that is, for the same search word, the number of times that the titles of the texts are clicked for browsing is greater in each searched text, the larger the relevant pipe between the search word and the title of the text is, so as to improve the relevance between the clicked and browsed titles and the search word, and further improve the accuracy of taking the corresponding search word as the extended seed word. In another possible implementation manner of the embodiment of the present application, step 202 may further include the following steps:

generating binary information of a title and an initial seed word aiming at each first search record;

and screening the titles according to the occurrence frequency of the binary information.

And using the search word of the searched title as an expansion seed word.

In this embodiment, after determining that an initial seed word is used as a search word and a searched title of a text with a browsing operation exists, in order to increase the title and the search word, that is, correlation between the initial seed words, to generate binary information of the title and the initial seed word, according to a preset frequency threshold, binary information whose occurrence frequency meets the frequency threshold is retained, and binary information that does not meet the frequency threshold is deleted, so that the title is screened, so that the retained title and the initial seed word have strong correlation, and the search word with the searched title is used as an extended seed word, thereby improving correlation and accuracy between the extended seed word and the initial seed word.

And step 203, taking the initial seed word and the extended seed word as seed words of the target content type.

Furthermore, the initial seed words and the extended seed words are used as the seed words of the target content type, so that the number of the seed words of the target content type is increased, and the accuracy of the seed words is ensured.

In the training sample generation method of this embodiment, the category name of the target content type is used as the initial seed word, and the search word which is recorded in the search log and can be searched for the title of the same text is used as the extended search word of the initial seed word based on the search log, so that the number of seed words is extended on the premise of ensuring the accuracy of the seed words, and thus, based on the seed words obtained by extension, the number of target texts obtained by search can be increased, and thus, the number of training samples of the target content type is increased.

In the above implementation, a training sample set of a target content type is generated according to the seed word obtained by expansion, and a keyword is generated according to a plurality of training samples in the training sample set, so that the existing seed word is replaced according to the keyword, and the seed word is updated. Therefore, the quality of the generated keyword is correlated with the quality of the expanded training sample according to the plurality of training samples in the training sample set, and therefore, in order to improve the quality of the generated keyword, the step 103 can be implemented by the following two possible implementations.

Based on the foregoing embodiment, as a possible implementation manner, fig. 3 is a schematic flowchart of a method for generating a training sample of a text classification model provided in the embodiment of the present application, and as shown in fig. 3, step 103 may include the following steps:

step 301, according to the seed words, a plurality of target samples containing the seed words in the title or the abstract are determined from the training sample set.

In one embodiment, the training sample set includes a plurality of training samples, and since the correlation between each sample including a seed word in the title or the abstract and the seed word is relatively high, the accuracy of the candidate keyword in the next step can be improved by using the plurality of samples including the seed word in the title or the abstract as the target sample.

Step 302, a plurality of candidate keywords are extracted for each target sample respectively.

As a possible implementation manner, for each target sample, performing word segmentation operation on all sentences of a target text to obtain word units of the sentences, acquiring word features of the word units, the sentence features of the word units in the corresponding sentences and text features of the word units in the target text, and performing keyword extraction operation on each sentence by using the word features, the sentence features and the text features of the word units in each sentence based on a deep learning model established by a deep learning algorithm to extract a plurality of candidate keywords from each sentence.

In this embodiment, the method for extracting the candidate keyword from the target sample is not limited.

Step 303, determining semantic relatedness between each candidate keyword and the target sample.

And 304, screening a plurality of candidate keywords according to the semantic relevance between each candidate keyword and the target sample to obtain the keywords.

As a possible implementation mode, the trained semantic recognition model is utilized to determine the semantic relevance between each candidate keyword and the target sample.

Furthermore, according to a preset correlation threshold, the semantic correlation between each candidate keyword and the target sample is compared with the first correlation threshold, candidate keywords smaller than the first correlation threshold are screened out, and candidate keywords with the semantic correlation between the candidate keywords and the target sample larger than or equal to the first correlation threshold are used as determined keywords, so that the candidate keywords with high semantic correlation are selected as the keywords according to the semantic correlation between the keywords and the target sample, and the accuracy of determining the keywords is improved.

In the training sample generation method of this embodiment, a plurality of target samples including seed words in titles or abstracts are determined from a training sample set according to the seed words, so that the correlation between the target samples and the seed words is increased, and further, according to the semantic correlation between the keywords and the target samples, candidate keywords with high semantic correlation are selected as the keywords, so that the accuracy of determining the keywords is improved.

In the training samples, there are target samples with a high degree of correlation with the seed word senses and also target samples with a low degree of correlation with the seed word senses, that is, semantic correlations between the candidate keywords extracted from the target samples and the seed words are also different. Therefore, the keywords are determined according to the semantic relevance between the candidate keywords and the seed words, the accuracy of the determined keywords can be improved, and the accuracy of the updated seed words is further improved.

Therefore, based on the foregoing embodiment, as another possible implementation manner, fig. 4 is a flowchart illustrating a method for generating a training sample of a text classification model according to another embodiment of the present application, and as shown in fig. 4, step 103 may include the following steps:

step 401, according to the seed words, a plurality of target samples containing the seed words in the title or the abstract are determined from the training sample set.

Step 402, extracting a plurality of candidate keywords for each target sample respectively.

Specifically, step 301 and step 302 in the previous embodiment can be referred to, and the principle is the same, which is not described herein again.

Step 403, determining semantic relevance between each candidate keyword and the seed word.

And step 404, screening a plurality of candidate keywords according to the semantic relevance between each candidate keyword and the seed word to obtain the keywords.

As a possible implementation manner, for each candidate keyword, converting the candidate keyword into a word vector, converting the seed word into a word vector, determining semantic correlation between the corresponding candidate keyword and the seed word by using cosine distance or Euclidean distance, and taking the candidate keyword of which the determined semantic correlation is greater than or equal to a second correlation threshold as the determined keyword according to a preset second correlation threshold. In the embodiment, the candidate keywords with high semantic relevance are selected as the keywords according to the semantic relevance between the keywords and the seed words, so that the accuracy of determining the keywords is improved.

In the training sample generation method of this embodiment, a plurality of target samples including seed words in titles or abstracts are determined from a training sample set according to the seed words, so that the correlation between the target samples and the seed words is increased, and further, according to the semantic correlation between the keywords and the seed words, candidate keywords with high semantic correlation are selected as the keywords, so that the accuracy of determining the keywords is improved.

In order to implement the above embodiments, the present application further provides a training sample generation apparatus for a text classification model.

Fig. 5 is a schematic structural diagram of a training sample generation apparatus for a text classification model according to an embodiment of the present application.

As shown in fig. 5, the apparatus includes: an acquisition module 51, an annotation module 52, an extraction module 53, an update module 54 and an execution module 55.

The obtaining module 51 is configured to obtain seed words of the target content type, obtain the seed words as search words, and obtain a multi-page target text through searching.

And the labeling module 52 is configured to label the multiple text targets as training samples, respectively, so as to generate a training sample set of the target content type.

The extracting module 53 is configured to generate a keyword according to a plurality of training samples in the training sample set.

And an updating module 54, configured to update the seed word according to the keyword.

And the execution module 55 is configured to take the updated seed word as a search word, search again, take the target text searched again as a training sample, label the target text, and add the target text to the training sample set.

In a possible implementation manner of the embodiment of the present application, the extracting module 53 includes:

a first determining unit, configured to determine, according to the seed word, a plurality of target samples including the seed word in a title or an abstract from the training sample set.

And the extracting unit is used for extracting a plurality of candidate keywords for each target sample respectively.

The first determining unit is configured to determine a semantic correlation between each of the candidate keywords and the target sample.

And the screening unit is used for screening the candidate keywords according to the semantic correlation degree between each candidate keyword and the target sample so as to obtain the keywords.

As another possible implementation manner, the first determining unit is configured to determine a semantic correlation between each candidate keyword and the seed word.

And the screening unit is used for screening the candidate keywords according to the semantic correlation degree between each candidate keyword and the seed word.

As a possible implementation manner, the updating module 54 is specifically configured to select the keyword according to the number of co-occurrences of the keyword and the seed word in the training sample set, so as to update the seed word with the selected keyword.

As a possible implementation manner, the obtaining module 51 includes:

and the second determining unit is used for taking the category name of the target content type as the initial seed word.

And the query unit is used for querying the log to obtain the expanded seed words of the same text searched by the initial seed words.

The second determining unit is further configured to use the initial seed word and the extended seed word as seed words of the target content type.

As a possible implementation manner, the querying unit is specifically configured to query the log to obtain a plurality of first search records using the initial seed word as a search word, determine, according to the plurality of first search records, a title of a text in which a browsing operation exists, and use the search word searched for the title as the extended seed word.

As a possible implementation manner, the query unit is specifically configured to generate binary information of the title and the initial seed word for each first search record, and filter the title according to occurrence frequency of the binary information.

As a possible implementation manner, the obtaining module 51 is further configured to query a log to obtain a plurality of second search records using the seed word as a search word, determine, according to the plurality of second search records, a plurality of browsing texts with browsing operations, and select the target text from the plurality of browsing texts according to browsing times of the plurality of browsing texts.

It should be noted that the foregoing explanation of the training sample generation method embodiment is also applicable to the training sample generation apparatus of this embodiment, and the principle is the same, and is not repeated here.

In the training sample generation device of this embodiment, a seed word of a target content type is acquired, the seed word is acquired as a search word, a multi-piece target text is obtained through searching, the multi-piece target text is respectively labeled as a training sample to generate a training sample set of the target content type, a keyword is generated according to a plurality of training samples in the training sample set, the seed word is updated according to the keyword, the updated seed word is used as the search word, searching is performed again, and the target text which is searched again is labeled as the training sample and added to the training sample set. According to the method and the device, the keywords are extracted from the existing training samples, the seed words are updated through the keywords, and then more training samples are obtained through the updated seed words, so that the training samples are expanded, manual operation is not needed, the cost is reduced, and the generation efficiency is improved.

In order to implement the above embodiments, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training sample generation for a text classification model as described in the method embodiments above.

In order to implement the foregoing embodiments, the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the training sample generation method of a text classification model according to the foregoing method embodiments.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for generating training samples of a text classification model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for generating training samples for a text classification model provided herein. A non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform a training sample generation method of a text classification model provided herein.

The memory 602 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training sample generation method of the text classification model in the embodiment of the present application (for example, the obtaining module 51, the labeling module 52, the extracting module 53, the updating module 54, and the executing module 55 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, namely, implementing the training sample generation method of the text classification model in the above method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device of the training sample generation method of the text classification model, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory remotely located from the processor 601, and these remote memories may be connected to the electronic device of the training sample generation method of the text classification model through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the training sample generation method of the text classification model may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device of a training sample generation method of a text classification model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the technical field of natural language processing and deep learning is achieved, the seed words of the target content type are obtained, the seed words are obtained and used as search words to search to obtain multi-piece target texts, the multi-piece target texts are respectively marked as training samples to generate a training sample set of the target content type, keywords are generated according to a plurality of training samples in the training sample set, the seed words are updated according to the keywords, the updated seed words are used as the search words to be searched again, the target texts which are searched again are marked as the training samples and are added to the training sample set. According to the method and the device, the keywords are extracted from the existing training samples, the seed words are updated through the keywords, and then more training samples are obtained through the updated seed words, so that the training samples are expanded, manual operation is not needed, the cost is reduced, and the generation efficiency is improved.

The training samples generated in the application are used for training the text classification model, the text classification model can be specifically trained in a deep learning mode, and compared with other machine learning methods, the deep learning method has the advantages that the performance of the deep learning on a large data set is better, and the training effect of the text classification model can be improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for generating training samples for a text classification model, the method comprising the steps of:

updating the seed words according to the keywords; and

2. The training sample generation method of claim 1, wherein the generating a keyword from a plurality of the training samples in the set of training samples comprises:

according to the seed words, determining a plurality of target samples containing the seed words in a title or an abstract from the training sample set;

extracting a plurality of candidate keywords for each target sample respectively;

determining semantic relatedness between each candidate keyword and the target sample; and

and screening a plurality of candidate keywords according to the semantic relevance between each candidate keyword and the target sample to obtain the keywords.

3. The training sample generation method according to claim 2, wherein after the extracting a plurality of candidate keywords for each of the target samples, further comprises:

determining semantic relevance between each candidate keyword and the seed word; and

and screening a plurality of candidate keywords according to the semantic relevance between each candidate keyword and the seed word.

4. The training sample generation method of claim 1, wherein the updating the seed word according to the keyword comprises:

and selecting the keywords according to the co-occurrence times of the keywords and the seed words in the training sample set so as to update the seed words by adopting the selected keywords.

5. The training sample generation method of any one of claims 1-4, wherein the obtaining seed words of a target content type comprises:

using the category name of the target content type as an initial seed word;

querying the log to obtain an expanded seed word which is the same as the initial seed word in the text searched; and

and taking the initial seed word and the expanded seed word as seed words of the target content type.

6. The training sample generation method of claim 5, wherein said querying the log for an expanded seed word that searches for the same text as the initial seed word comprises:

determining the title of a text with browsing operation according to the first search records;

and taking the search word of the searched title as the expansion seed word.

7. The training sample generation method according to claim 6, wherein the step of searching for the search word of the title as the extended seed word further comprises:

generating binary information of the title and the initial seed words for each first search record;

and screening the title according to the occurrence frequency of the binary information.

8. The training sample generation method of any of claims 1-4, wherein the searching results in a multi-space target text comprising:

inquiring the log to obtain a plurality of second search records taking the seed words as search words;

determining a plurality of browsing texts with browsing operations according to the plurality of second search records; and

and selecting the target text from the plurality of browsing texts according to the browsing times of the plurality of browsing texts.

9. A training sample generation apparatus, the apparatus comprising:

10. The training sample generation apparatus of claim 9, wherein the extraction module comprises:

a first determining unit, configured to determine, according to the seed word, a plurality of target samples including the seed word in a title or an abstract from the training sample set;

an extracting unit, configured to extract a plurality of candidate keywords for each of the target samples, respectively;

the first determining unit is used for determining the semantic relevance between each candidate keyword and the target sample;

11. The training sample generating apparatus of claim 10,

the first determining unit is further configured to determine a semantic correlation between each candidate keyword and the seed word;

the screening unit is further configured to screen the plurality of candidate keywords according to semantic relevance between each candidate keyword and the seed word.

12. The training sample generating apparatus according to claim 1,

the updating module is specifically configured to select the keyword according to the number of co-occurrences of the keyword and the seed word in the training sample set, so as to update the seed word with the selected keyword.

13. The training sample generating apparatus of any of claims 9-12, wherein the obtaining module comprises:

a second determining unit, configured to use the category name of the target content type as an initial seed word;

the query unit is used for querying the log to obtain an expanded seed word of the same text searched by the initial seed word; and

14. The training sample generating apparatus of claim 13,

the query unit is specifically configured to query the log to obtain a plurality of first search records using the initial seed word as a search word, determine a title of a text in which a browsing operation exists according to the plurality of first search records, and use the search word searched for the title as the extended seed word.

15. The training sample generating apparatus of claim 14,

the query unit is further specifically configured to generate binary information of the title and the initial seed word for each of the first search records, and filter the title according to the occurrence frequency of the binary information.

16. The training sample generating apparatus according to any one of claims 9 to 12,

the obtaining module is further configured to query a log to obtain a plurality of second search records using the seed word as a search word, determine a plurality of browsing texts with browsing operations according to the plurality of second search records, and select the target text from the plurality of browsing texts according to browsing times of the plurality of browsing texts.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training sample generation method of any of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the training sample generation method of any one of claims 1 to 8.