CN112101020B

CN112101020B - Method, apparatus, device and storage medium for training key phrase identification model

Info

Publication number: CN112101020B
Application number: CN202010880346.0A
Authority: CN
Inventors: 杨虎; 汪琦; 王述; 张晓寒; 冯知凡; 柴春光
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2023-08-04
Anticipated expiration: 2040-08-27
Also published as: CN112101020A

Abstract

The application discloses a method, a device, equipment and a storage medium for training a key phrase identification model, and relates to the fields of artificial intelligence, knowledge maps and deep learning. The method for training the key phrase identification model comprises the following steps: acquiring first training data related to a target field, wherein key phrases related to the target field in first training texts of the first training data are identified; acquiring general training data which is not related to the target field, wherein key phrases which are not related to the target field in general training texts of the general training data are identified; and training a key phrase identification model for the target domain based on the first training data and the general training data, for identifying the text to be identified related to the target domain. In this way, with a small amount of identified data in the target domain, an accurate key phrase identification model for the target domain can be obtained.

Description

Method, apparatus, device and storage medium for training key phrase identification model

Technical Field

The present disclosure relates to the field of data processing, in particular, to the field of artificial intelligence, knowledge-graph and deep learning, and more particularly, to methods, apparatus, devices and storage media for training key phrase identification models.

Background

With the development of computer technology, various data processing schemes based on machine learning technology are available. For example, machine learning techniques have been utilized to process text, thereby identifying key phrases in the text. For example, for a video, the title and introduction text may contain key phrases that understand the video content. However, since these texts may belong to different domains, the features of the different domains are different, an identification model for a particular domain may be required when key phrase identification is performed for a text related to the domain, and a large amount of manually identified data is required to train the identification model.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for training a key phrase identification model.

According to a first aspect of the present disclosure, a method for training a key phrase identification model is provided. The method includes obtaining first training data related to a target domain, wherein key phrases related to the target domain in first training text of the first training data are identified. The method further includes obtaining generic training data that is not related to the target domain, wherein key phrases in generic training text of the generic training data that are not related to the target domain are identified. The method further includes training a key phrase identification model for the target domain based on the first training data and the generic training data for identifying text to be identified that is related to the target domain.

According to a second aspect of the present disclosure, there is provided a method for identifying key phrases in text to be identified. The method comprises the steps of obtaining text to be identified related to the target field. The method further includes identifying key-phrase related to the target domain in the text to be identified using a key-phrase identification model trained in accordance with the method of the first aspect of the present disclosure.

According to a third aspect of the present disclosure, an apparatus for training a key phrase identification model is provided. The apparatus includes a first training data acquisition module configured to acquire first training data related to a target domain, wherein key phrases related to the target domain in first training text of the first training data are identified. The apparatus also includes a generic training data acquisition module configured to acquire generic training data that is not related to the target domain, wherein key phrases in generic training text of the generic training data that are not related to the target domain are identified. The apparatus further includes a model training module configured to train a key phrase identification model for the target domain based on the first training data and the generic training data for identifying text to be identified that is related to the target domain.

According to a fourth aspect of the present disclosure, there is provided an apparatus for identifying key phrases in text to be identified. The device comprises: and the text to be identified acquisition module is configured to acquire the text to be identified related to the target field. The apparatus further comprises: a text to be identified identification module configured to identify key-phrases in text to be identified that are related to a target domain using the key-phrase identification model trained in accordance with the method of the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to the first aspect of the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the first aspect of the present disclosure.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising computer program instructions for implementing a method according to the first or second aspect of the present disclosure by a processor.

The techniques according to the present application can train with a small amount of identified data in the target domain to obtain an accurate key phrase identification model for the target domain.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example system 100 in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a flowchart of a method 200 for training a key phrase identification model, according to some embodiments of the present disclosure;

FIG. 3 illustrates an example environment 300 for obtaining a key phrase identification model for a target domain, according to some embodiments of the disclosure;

FIG. 4 illustrates a flowchart of a method 400 of obtaining a key phrase identification model for a target domain, according to some embodiments of the present disclosure;

FIG. 5 illustrates a flowchart of a method 500 of acquiring second training data, according to some embodiments of the present disclosure;

FIG. 6 illustrates an example environment 600 for training a key phrase identification model in accordance with some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of a method 700 for identifying key phrases in text to be identified, according to some embodiments of the present disclosure;

FIG. 8 illustrates a schematic block diagram of an apparatus 800 for training a key phrase identification model in accordance with an embodiment of the present disclosure;

FIG. 9 illustrates a schematic block diagram of an apparatus 900 for training a key phrase identification model in accordance with an embodiment of the present disclosure; and

fig. 10 illustrates a block diagram of an electronic device 1000 capable of implementing various embodiments of the disclosure.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

The scheme for extracting the key phrase by utilizing the rule mode has poor mobility and limited solution. When the text in different fields is faced, corresponding new rules are required to be added according to the characteristics of the fields, and meanwhile, the contradiction between the written rules and the written rules can be faced. Schemes based on traditional machine learning require manual design of the active features. However, such features are more difficult to adapt to different fields, and for training data required for model training, there may be some specific fields where the amount of collectable training data is small and the time cost for collecting training data is high, so that an accurate identification model cannot be trained. Moreover, even if a sufficient amount of training data can be collected, the collected training data is often manually identified, which is costly and requires participation or guidance of a domain expert for certain specialized domains.

To at least partially address one or more of the problems described above, as well as other potential problems, embodiments of the present disclosure propose a technique for training a key phrase identification model. In this scheme, by training on the basis of the key phrase identification model for the target domain using both training data related to the target domain and general training data, an identification model for a key phrase related to the target domain in a text to be identified related to the target domain can be obtained. In this way, the key phrase identification model for the target domain can be trained to be accurate for the particular target domain using only a small amount of identified data in the target domain, thereby reducing the collection and identification costs of data. In addition, the scheme can be suitable for different multiple fields, and only a plurality of training data corresponding to the multiple fields are needed to be prepared, so that the key phrase identification model for the multiple fields can be obtained.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Herein, the term "field" may refer to a range of specialized activities or utilities, examples of which may include, but are not limited to, military, literature, art, medicine, sports, and the like. The term "model" may learn the association between the corresponding input and output from the training data so that the corresponding output may be generated for a given input after training is completed. It should be understood that a "model" may also be referred to as a "neural network", "learning model", or "learning network". The term "key phrase" may sometimes be referred to as a "named entity," which may refer to one or more keywords that appear in a piece of content. The "key phrase" may be determined based on the intent of the user, examples of which may include, but are not limited to, a person's name, place name, book name, health related entity, middle American related entity, and so forth.

Fig. 1 illustrates a schematic diagram of an example system 100 in which various embodiments of the present disclosure may be implemented. The system 100 may generally include a model training subsystem 110 and a model application subsystem 120. It should be understood that the structure and function of system 100 are described for exemplary purposes only and do not imply any limitation on the scope of the present disclosure. Embodiments of the present disclosure may also be applied in environments having different structures and/or functions.

In the model training subsystem 110, the model training means 111 may acquire the first training data 101 and the generic training data 102. It will be appreciated that the first training data 101 is for a target domain, while the generic training data 102 may be for various domains.

The model training device 111 may train with the first training data 101 and the general training data 102, so that the trained key phrase identification model 103 for the target domain may accurately identify the text to be identified related to the target domain. The training process based on both data may be based on the initial identification model 112. Prior to training with both data, the initial identification model 112 may be trained with training data related to the target domain, but not identified, to obtain the key phrase identification model 103. In this way, the requirement for the number of identified training samples in the target domain may be reduced. Further, training may be based on a small number of training samples in the target domain (i.e., the first training data 101), i.e., a final model that is accurate for the target domain may be obtained.

In the model application subsystem, the model application means 121 may obtain the text 104 to be identified, which is related to the target area. The text 104 to be identified is input into the key phrase identification model 103 for the target domain for processing, and then the model application device 121 may output an identification result 105 for the text 104 to be identified, and the identification result 1052 may identify key phrases related to the target domain in the text to be identified.

For clarity of illustration, embodiments of the present disclosure will be described below with reference to system 100 of fig. 1. It should be understood that embodiments of the present disclosure may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect. For ease of understanding, specific data set forth in the following description are intended to be exemplary and are not intended to limit the scope of the disclosure.

FIG. 2 illustrates a flowchart of a method 200 for training a key phrase identification model, according to some embodiments of the present disclosure. For example, the method 200 may be performed by the model training apparatus 111 as shown in fig. 1. The various actions of method 200 will be described in detail below in conjunction with FIG. 1. It should be appreciated that method 200 may also include additional actions not shown and/or may omit actions shown. The scope of the present disclosure is not limited in this respect.

At block 202, the model training apparatus 111 obtains first training data 101 related to the target domain, wherein key phrases related to the target domain in first training text of the first training data 101 are identified.

Taking the example where the target area is military, the first training data 101 may include a plurality of first training texts related to military. The first training text may be a sentence or a paragraph associated with military. For example, the first training text may be extracted from entries of encyclopedia data or knowledge maps of the military domain, where key phrases related to military are identified.

In some embodiments, for example, a B (Begin), I (side), O (outlide) tag may be utilized to identify key-phrases in text, where the B tag is used to identify the starting character of the key-phrase, the I tag is used to identify other characters in the key-phrase than the initial starting character, and the O tag is used to identify other characters in the sentence that do not belong to the key-phrase. In addition to the first training text, the "identified" text referred to herein may all be identified using this method.

In some other embodiments of the present disclosure, other tags besides BIO tags may also be utilized to identify key phrases in training text, the scope of the present disclosure is not limited in this respect.

In some embodiments, the first training data 101 is obtained by the following steps. First, a first training text is acquired that is related to the target area, the number of first training texts may be a smaller amount, for example, on the order of hundreds or thousands. The first training text is then identified such that key phrases in the first training text that are related to the target domain are identified, which may be performed manually in some embodiments. Next, first training data 101 may be generated based on the identified first training text.

In the manner described above, the model training device 111 may obtain the identified plurality of training texts related to the target field.

At block 204, the model training apparatus 111 obtains generic training data 102 that is not related to the target domain, wherein key phrases in generic training text of the generic training data 102 that are not related to the target domain are identified.

It will be appreciated that, since the generic training data 102 is not necessarily related to a specific target area, the number of generic training texts in the generic training data 102 that can be obtained by the model training device 111 may be much larger than the number of first training texts. In some embodiments, the generic training data 102 may employ text that has been identified in other key phrase identification tasks. In some other embodiments, the generic training data 102 may also be collected by mining in the following manner. First, the model training device 111 may obtain a plurality of training texts having user tags, for example, in a database such as a video website or encyclopedia data or question and answer website, a large amount of data (which contains corresponding texts, e.g., video titles) is identified with tags identified when the user uploads or publishes, which may generally be an understanding and description of the data content by the user, and thus generally have a higher relevance to the data content, and the model training device 111 may obtain a plurality of training texts associated with the data, and a corresponding plurality of tags. The model training device 111 may then identify the plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified. The model training device 111 may identify the plurality of training texts using a preliminary identification model to understand the content of the training texts. Next, the model training apparatus 111 selects a general training text from the identified plurality of training texts based on the user tag and the plurality of key phrases, for example, the model training apparatus 111 may filter the plurality of training texts based on the degree of matching between the user tag and the identified plurality of key phrases in combination with information such as the frequency of use, the number of tags, and the tag part of speech of data corresponding to the plurality of training texts, thereby selecting a general training text therefrom. Finally, the model training device 111 may generate the generic training data 102 based on the generic training text.

At block 206, the model training apparatus 111 trains a key phrase identification model for the target domain based on the first training data 101 and the generic training data 102. The resulting key phrase identification model 103 may be used to identify text 104 to be identified that is related to the target domain.

The model training means 111 may train the key phrase identification model 103 for the target domain using both the identified first training data 101, which contains a small amount of training data but is strongly related to the target domain, and the identified generic training data 102, which contains a large amount of training data but is weakly related to the target domain, to adjust the key phrase identification model 103, updating its parameters so that it is more adapted to the target domain. The process of obtaining the key phrase identification model 103 for the target domain will be described in detail below with reference to fig. 3 and 4.

In some embodiments, training of the key phrase identification model 103 includes iteratively performing, by the model training device 111, the following steps until the identification accuracy of the key phrase identification model 103 is above a predetermined threshold: training the key phrase identification model 103 with the first training data 101 to update the key phrase identification model 103; the updated key phrase identification model 103 is then trained with the generic training data 102 to again update the key phrase identification model 103. By using the two training data to train the model, the accuracy of the identification model can be improved by using the general training data 102 as a supplement under the condition that the number of identified training samples in the first training data 101 for the target field is small, so that the model obtained by training can accurately identify the text 104 to be identified, in particular the text 104 to be identified related to the target field.

It will be appreciated that the number and order of use of the first training data 101 and the generic training data 102 by the model training device 111 in the above-described iterative process may be arbitrary, provided that both are used a sufficient number of times during the training process such that the accuracy of the final key phrase identification model 103 is above a predetermined threshold. For example, the model training device 111 may also train the updated key phrase identification model 103 once using the first training data 101, then train the updated key phrase identification model 103 once using the generic training data 102, and train the updated key phrase identification model 103 once using the first training data 101, and so iterate. The model training means 111 may also use the first training data 101 to train the updated key phrase identification model 103 twice and then use the generic training data 102 to train the updated key phrase identification model 103 once, thus iterating.

Thus, according to embodiments of the present disclosure, training may be performed with only a small amount of identified data in the target domain to obtain an accurate key phrase identification model for the target domain, thereby reducing the collection and identification costs of the data. Moreover, the scheme can be suitable for different multiple fields, and only a plurality of training data corresponding to the multiple fields are needed to be prepared, so that the key phrase identification model for the multiple fields can be obtained.

FIG. 3 illustrates an example environment 300 for obtaining a key phrase identification model for a target domain, according to some embodiments of the disclosure. Fig. 4 illustrates a flowchart of a method 400 of obtaining a key phrase identification model for a target domain, according to some embodiments of the present disclosure. For example, the method 400 may be performed by the model training apparatus 111 shown in fig. 1, or the model training apparatus 311 shown in fig. 3. The various actions of method 400 are described in detail below in conjunction with FIG. 3. It should be appreciated that method 400 may also include additional actions not shown and/or may omit actions shown. The scope of the present disclosure is not limited in this respect.

At block 402, model training device 311 obtains second training data 301 associated with the target area, the second training text of second training data 301 being unidentified.

In particular, the second training data 301 is related to the target domain and may be unidentified in the second training text in the second training data 301, and thus, employing such data may reduce the cost for generating the key phrase identification model 303. In some embodiments, the second training data 301 may be obtained from a database containing text, such as a knowledge graph, encyclopedia data, question-answering website, etc., using text mining techniques, the process of which will be described in detail below with reference to fig. 5.

At block 404, the model training device 311 trains the initial identification model 312 based on the second training data 301 to obtain the key phrase identification model 303 for the target domain.

The initial identification model 312 may be an existing generic key phrase identification model 303 for, for example, BERT, ERINE models, etc., but the identification accuracy of these models for text to be identified in the target domain may not be high. Specifically, the second training data 301 may be input into the initial identification model 312, and then parameters of the initial identification model 312 may be adjusted based on the output result to update the initial identification model 312. The above procedure is repeated in this manner, and the key phrase identification model 303 for the target domain can be obtained.

In this way, training with unidentified second training data may result in a key phrase identification model 303 for the target domain as a basis for subsequent further training to identify text to be identified, since the second training data does not require manual identification, there is a large amount of text available for collection in the target domain, and thus the cost of training data collection and identification may be reduced.

Fig. 5 illustrates a flowchart of a method 500 of acquiring second training data, according to some embodiments of the present disclosure. Method 500 is one exemplary specific process of block 402 in method 400. For example, the method 500 may be performed by the model training apparatus 111 as shown in fig. 1. The various actions of method 500 will be described in detail below in conjunction with FIG. 1. It should be appreciated that method 500 may also include additional actions not shown and/or may omit actions shown. The scope of the present disclosure is not limited in this respect.

At block 502, the model training apparatus 111 determines a plurality of key phrase types associated with the target domain.

Specifically, again taking the target field as "military" as an example, model training apparatus 111 may determine a plurality of key phrase types associated with the military field, including, but not limited to: weapons, airplanes, missiles, tanks, firearms, pistols, etc.

At block 504, the model training apparatus 111 determines a plurality of key phrase entries associated with a plurality of key phrase types from a database.

In some embodiments, model training apparatus 111 may utilize the plurality of key phrase types described above to filter a database, such as a knowledge graph, encyclopedia data, question and answer website, to filter a plurality of key phrase entries matching the plurality of key phrase types described above, e.g., for firearm types, model training apparatus 111 may determine entries associated with firearm types, such as M9 pistols, M14 rifle, M12S submachine guns, etc.

At block 506, the model training device 111 obtains second training data based on the plurality of key phrase entries.

In some embodiments, the model training device 111 may extract a corresponding plurality of descriptive text from the plurality of key phrase entries determined in block 504 as the second training text and generate the second training data based on the second training text. For example, when the database is encyclopedia data, the model training device 111 may extract the determined profile text for each entry under that entry as the second training text. Taking the military field again as an example, when the determined key phrase entry is M9 pistol, the model training device 111 can extract the descriptive text "M9 pistol adopts the barrel short stroke recoil action principle, the locking mode is a cassette sinking type, the single/double action trigger design is used for feeding the bullet with 15 removable magazines, the length of the gun is 217mm, the weight is 1.1kg (including a loading clip), and the initial speed of the bullet is 390M/s. M9 simple structure, mechanical action is reliable. Full gun life greater than 5000 shots "as the second training text.

Since all key phrase types may not be determined at once for the target domain. Thus, in some embodiments, the step of determining a plurality of key phrase types in block 502 may specifically include the following steps.

First, the model training apparatus 111 may determine candidate key phrase types associated with a target domain, e.g., for a military domain, 5 candidate key phrase types associated with a military domain, e.g., weapons, airplanes, missiles, tanks, firearms. Then, using similar steps as in block 504, the model training device 111 may determine candidate key phrase entries associated with candidate key phrase types from the database, e.g., for firearm types, the model training device 111 may determine entries related to firearm types, e.g., M9 pistols, M14 rifles, M12S submachines, etc. Based on the candidate key-phrase entries, the model training apparatus 111 may determine an extended key-phrase type associated with the target field, e.g., for an entry M9 pistol belonging to the firearm type, may determine that the M9 pistol also belongs to the "pistol" type, the model training apparatus 111 may count the total number of entries belonging to the "pistol" type among the candidate key-phrase entries, and may regard "pistol" as the extended key-phrase type if it is determined that the total number of entries is greater than a predetermined threshold. In this way, several extended key phrase types may be determined based on the candidate key phrase entries. Model training apparatus 111 may then determine a plurality of key phrase types based on the candidate key phrase types and the extended key phrase types, and further perform steps in the database based on the plurality of key phrase types to extend the second training data as described in blocks 504 and 506. In some embodiments, the plurality of key phrase types may include candidate key phrase types, as well as some of the extended key phrase types. Model training apparatus 111 may filter the extended key phrase types through various rules to determine some of the key phrase types described above, some examples of which include relevance to the military field and/or coincidence with existing candidate key phrase types, and the like.

It will be appreciated that the above process may be repeatedly performed to obtain a large amount of second training data associated with the target domain to cover as much of the key phrase type in the target domain as possible. The above procedure may also be performed for different multiple domains to obtain second training data for the different multiple domains.

In this way, a large amount of text related to the target domain may be obtained from the existing database as second training data for training the initial identification model 112 to obtain the key phrase identification model 103 for the target domain.

FIG. 6 illustrates an example environment 600 for training a key phrase identification model, according to some embodiments of the disclosure. The specific process of training in block 206 will be described below in conjunction with fig. 6.

The first training data 601 and the generic training data 602 may be used separately to train a key phrase identification model for a target domain. A training process using the first training text in the first training data 601 will be described as an example.

First, the first training data 601 is input to the model training device 611. The model training device 611 may pre-process the first training text in the first training data 601 to divide sentences in the first training text into a plurality of characters or a plurality of words. A first training text containing a plurality of characters may then be input into the key phrase identification model 603 for the target domain to generate a first vector for the plurality of characters. These first vectors are then input into a recurrent neural network 604 for processing to generate second vectors, examples of the recurrent neural network 604 include, but are not limited to, unidirectional LSTM (long short-term memory artificial neural network) and bidirectional LSTM. The second vector is then processed using a model 605, such as a conditional random field, to generate a label (e.g., BIO) for the plurality of characters. As described with respect to block 202 of fig. 2, these tags may be compared to the tags of the first training data because the first training text in the first training data has been identified. Parameters can then be adjusted based on the results of the comparison to update the key phrase identification model 603, etc. for the target domain. It is to be understood that the training process of the model training device 611 using the training text in the general training data 602 is similar, and will not be described here.

Fig. 7 illustrates a schematic diagram of a method 700 for identifying key phrases in text to be identified, according to some embodiments of the present disclosure. For example, the method 700 may be performed by the model application device 121 as shown in fig. 1. The various actions of method 700 will be described in detail below in conjunction with FIG. 1. It should be appreciated that method 500 may also include additional actions not shown and/or may omit actions shown. The scope of the present disclosure is not limited in this respect.

At block 702, the model application device 121 obtains text 104 to be identified that is related to the target area. For example, the text to be identified 104 may include any video title text to be identified, video description text, or user input text, etc.

At block 704, the model application device 121 identifies key-phrases related to the target domain in the text 104 to be identified using the trained key-phrase identification model 103.

In some embodiments, the model application means 121 may split the text to be identified into one or more phrases. The model application means 121 may then utilize the key phrase identification model 103 to determine the respective tags (e.g., the BIO tags as discussed above) of the characters therein and identify the key phrases in the sentence based on the respective tags.

Fig. 8 shows a schematic block diagram of an apparatus 800 for training a key phrase identification model according to an embodiment of the disclosure. As shown in fig. 8, the apparatus 800 may include a first training data acquisition module 802 configured to acquire first training data related to a target domain, wherein key phrases related to the target domain in first training text of the first training data are identified. The apparatus 800 may further include a generic training data acquisition module 804 configured to acquire generic training data that is not related to the target domain, wherein key phrases in generic training text of the generic training data that are not related to the target domain are identified. The apparatus 800 may further include a model training module 806 configured to train a key phrase identification model for the target domain based on the first training data and the generic training data for identifying text to be identified that is related to the target domain.

In some embodiments, the apparatus 800 further comprises: a second training data acquisition module configured to acquire second training data related to the target area, a second training text of the second training data being unidentified; and an initial model training module configured to train the initial identification model based on the second training data to obtain a key phrase identification model for the target domain.

In some embodiments, the second training data acquisition module further comprises: a key phrase type determination module configured to determine a plurality of key phrase types associated with the target domain; a key phrase entry determination module configured to determine a plurality of key phrase entries associated with a plurality of key phrase types from a database; and wherein the second training data acquisition module is configured to acquire the second training data based on the plurality of key phrase entries.

In some embodiments, the key phrase type determination module further comprises: a candidate key-phrase type determination module configured to determine a candidate key-phrase type associated with the target domain; a candidate key-phrase entry determination module configured to determine candidate key-phrase entries associated with candidate key-phrase types from a database; an extended key phrase type determination module configured to determine an extended key phrase type associated with the target domain based on the candidate key phrase entries; and wherein the key phrase type determination module is configured to determine a plurality of key phrase types based on the candidate key phrase types and the extended key phrase types.

In some embodiments, the second training data acquisition module further comprises: a descriptive text extraction module configured to extract a corresponding plurality of descriptive texts from the plurality of key phrase entries as a second training text; and a second training data generation module configured to generate second training data based on the second training text.

In some embodiments, the second training data acquisition module further comprises: the first training text acquisition module acquires a first training text related to the target field; the first training text identification module is configured to identify the first training text so that key phrases related to the target field in the first training text are identified; and a first training text generation module configured to generate first training data based on the identified first training text.

In some embodiments, the generic training data acquisition module 804 further comprises: a universal training text acquisition module configured to acquire a plurality of training texts having user tags; the universal training text identification module is configured to identify a plurality of training texts so that a plurality of corresponding key phrases in the training texts are identified; a generic training text selection module configured to select a generic training text from the identified plurality of training texts based on the user tag and the plurality of key phrases; and a generic training text generation module configured to generate generic training data based on the generic training text.

In some embodiments, model training module 806 further includes: a first training module configured to train the key phrase identification model with the first training data to update the key phrase identification model; and a second training module configured to train the updated key phrase identification model with the generic training data to update the key phrase identification model again; wherein the model training module 806 is configured to cause the first training module and the second training module to operate iteratively.

Fig. 9 shows a schematic block diagram of an apparatus 900 for training a key phrase identification model in accordance with an embodiment of the present disclosure. As shown in fig. 9, the apparatus 900 may include a text to be identified acquisition module 902 configured to acquire text to be identified related to a target area. The apparatus 900 may further include: the text to be identified identification module 904 is configured as a trained key phrase identification model to identify key phrases in the text to be identified that are related to the target domain.

According to embodiments of the present application, an electronic device and a readable storage medium and a computer program product are also provided.

As shown in fig. 10, a block diagram of an electronic device 1000 is provided for a method of training a key phrase identification model in accordance with an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 10, the electronic device includes: one or more processors 1001, memory 1002, and interfaces for connecting the components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 1001 is illustrated in fig. 10.

Memory 1002 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of training the key phrase identification model provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of training the key phrase identification model provided herein.

The memory 1002 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the first training data acquisition module 802, the general training data acquisition module 804, and the model training module 806 shown in fig. 8) corresponding to a method for training a key phrase identification model in an embodiment of the present application. The processor 1001 executes various functional applications of the server and data processing, that is, implements the method of training the key phrase identification model in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 1002.

Memory 1002 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the electronic device that trains the key phrase identification model, and the like. In addition, the memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1002 optionally includes memory remotely located with respect to processor 1001, which may be connected to the electronic device that trains the key phrase identification model over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of training the key phrase identification model may further include: an input device 1003 and an output device 1004. The processor 1001, memory 1002, input device 1003, and output device 1004 may be connected by a bus or other means, for example by a bus connection in fig. 10.

The input device 1003 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device training the key phrase identification model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a joystick, one or more mouse buttons, a track ball, a joystick, and the like. The output means 1004 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

By training on the basis of the key phrase identification model for the target domain using both training data related to the target domain and general training data, an identification model for key phrases related to the target domain in text to be identified related to the target domain can be obtained. In this way, the key phrase identification model for the target domain can be trained to be accurate for the particular target domain using only a small amount of identified data in the target domain, thereby reducing the collection and identification costs of data. In addition, the scheme can be suitable for different multiple fields, and only a plurality of training data corresponding to the multiple fields are needed to be prepared, so that the key phrase identification model for the multiple fields can be obtained.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of training a key phrase identification model, comprising:

acquiring first training data related to a target field, wherein key phrases related to the target field in first training texts of the first training data are identified;

acquiring general training data which is not related to the target field, wherein key phrases which are not related to the target field in general training texts of the general training data are identified; and

Training a key phrase identification model for the target domain based on the first training data and the general training data, for identifying a text to be identified related to the target domain;

the method further comprises the steps of:

acquiring second training data related to the target field, wherein second training texts of the second training data are not identified; and

training an initial identification model based on second training data to obtain the key phrase identification model for the target domain;

wherein obtaining the second training data comprises:

determining a plurality of key phrase types associated with the target domain;

determining a plurality of key phrase entries associated with the plurality of key phrase types from a database; and

obtaining the second training data based on the plurality of key phrase entries;

wherein determining the plurality of key phrase types comprises:

determining candidate key phrase types associated with the target domain;

determining candidate key-phrase entries associated with the candidate key-phrase type from the database;

determining an extended key phrase type associated with the target domain based on the candidate key phrase entry; and

The plurality of key phrase types is determined based on the candidate key phrase type and the extended key phrase type.

2. The method of claim 1, wherein obtaining the second training data based on the plurality of key phrase entries comprises:

extracting a corresponding plurality of descriptive text from the plurality of key phrase entries as the second training text; and

the second training data is generated based on the second training text.

3. The method of claim 1, wherein obtaining the first training data comprises:

acquiring the first training text related to the target field;

identifying the first training text so that key phrases related to the target field in the first training text are identified; and

the first training data is generated based on the identified first training text.

4. The method of claim 1, wherein obtaining the generic training data comprises:

acquiring a plurality of training texts with user labels;

identifying the plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified;

Selecting the generic training text from the identified plurality of training texts based on the user tag and the plurality of key phrases; and

the generic training data is generated based on the generic training text.

5. The method of claim 1, wherein training the key phrase identification model comprises iteratively performing the following:

training the key phrase identification model with the first training data to update the key phrase identification model; and

training the updated key phrase identification model with the generic training data to again update the key phrase identification model.

6. A method for identifying key phrases in text to be identified, comprising:

acquiring a text to be identified related to a target field; and

the key-phrase identification model trained with the method according to any one of claims 1 to 5, identifying key-phrases in the text to be identified that are related to the target domain.

7. An apparatus for training a key phrase identification model, comprising:

a first training data acquisition module configured to acquire first training data related to a target domain, wherein key phrases related to the target domain in a first training text of the first training data are identified;

A generic training data acquisition module configured to acquire generic training data that is not related to the target domain, wherein key phrases in generic training text of the generic training data that are not related to the target domain are identified; and

a model training module configured to train a key phrase identification model for the target domain based on the first training data and the generic training data for identifying text to be identified that is related to the target domain;

the apparatus further comprises:

a second training data acquisition module configured to acquire second training data related to the target area, a second training text of the second training data being unidentified; and

an initial model training module configured to train an initial identification model based on second training data to obtain the key phrase identification model for the target domain;

the second training data acquisition module further includes:

a key phrase type determination module configured to determine a plurality of key phrase types associated with the target domain;

a key phrase entry determination module configured to determine a plurality of key phrase entries associated with the plurality of key phrase types from a database; and is also provided with

Wherein the second training data acquisition module is configured to acquire the second training data based on the plurality of key phrase entries;

wherein the key phrase type determination module further comprises:

a candidate key-phrase type determination module configured to determine a candidate key-phrase type associated with the target domain;

a candidate key phrase entry determination module configured to determine a candidate key phrase entry associated with the candidate key phrase type from the database;

an extended key phrase type determination module configured to determine an extended key phrase type associated with the target domain based on the candidate key phrase entry; and is also provided with

Wherein the key phrase type determination module is configured to determine the plurality of key phrase types based on the candidate key phrase type and the extended key phrase type.

8. The apparatus of claim 7, wherein the second training data acquisition module further comprises:

a descriptive text extraction module configured to extract a corresponding plurality of descriptive texts from the plurality of key phrase entries as the second training text; and

A second training data generation module configured to generate the second training data based on the second training text.

9. The apparatus of claim 7, wherein the first training data acquisition module further comprises:

the first training text acquisition module acquires the first training text related to the target field;

a first training text identification module configured to identify the first training text such that key phrases in the first training text that are related to the target domain are identified; and

a first training text generation module configured to generate the first training data based on the identified first training text.

10. The apparatus of claim 7, wherein the generic training data acquisition module further comprises:

a universal training text acquisition module configured to acquire a plurality of training texts having user tags;

a generic training text identification module configured to identify the plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified;

a generic training text selection module configured to select the generic training text from the identified plurality of training texts based on the user tag and the plurality of key phrases; and

A generic training text generation module configured to generate the generic training data based on the generic training text.

11. The apparatus of claim 7, wherein the model training module further comprises:

a first training module configured to train the key phrase identification model with the first training data to update the key phrase identification model; and

a second training module configured to train the updated key phrase identification model with the generic training data to again update the key phrase identification model;

wherein the first training module and the second training module are operated iteratively.

12. An apparatus for identifying key phrases in text to be identified, comprising:

the text to be identified obtaining module is configured to obtain a text to be identified related to the target field; and

a text to be identified identification module configured to identify key-phrases in the text to be identified that are related to the target domain using the key-phrase identification model trained in accordance with the method of any one of claims 1 to 5.

13. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.