CN113407713A - Corpus mining method and apparatus based on active learning and electronic device - Google Patents

Corpus mining method and apparatus based on active learning and electronic device Download PDF

Info

Publication number
CN113407713A
CN113407713A CN202011141662.2A CN202011141662A CN113407713A CN 113407713 A CN113407713 A CN 113407713A CN 202011141662 A CN202011141662 A CN 202011141662A CN 113407713 A CN113407713 A CN 113407713A
Authority
CN
China
Prior art keywords
corpus
classification
gram
cold
corpora
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011141662.2A
Other languages
Chinese (zh)
Other versions
CN113407713B (en
Inventor
习自
赵学敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011141662.2A priority Critical patent/CN113407713B/en
Publication of CN113407713A publication Critical patent/CN113407713A/en
Application granted granted Critical
Publication of CN113407713B publication Critical patent/CN113407713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a corpus mining method and device based on active learning and electronic equipment, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring unmarked corpora; classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus; selecting the unmarked corpus with inconsistent first classification types and the classification scores meeting the preset conditions as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked. The technical scheme can be beneficial to widening the coverage of corpus excavation and improving the generalization of corpus excavation.

Description

Corpus mining method and apparatus based on active learning and electronic device
Technical Field
The application relates to the field of artificial intelligence, in particular to a corpus mining method and device based on active learning, an electronic device and a computer readable storage medium.
Background
With the higher requirements of people on the living quality, a plurality of intelligent assistants are gradually appeared in our lives, such as Tencent cloud small intelligent assistants and the like. The user may ask the intelligent assistant for relevant information, etc. by voice input, text input, etc. The accurate understanding of the user requirements is a basic premise for the intelligent assistant to provide services, and in order to improve the intelligence level of the intelligent assistant, the skills related to the intelligent assistant sometimes need to be subjected to corpus mining so as to meet different requirements of different users in different scenes on the intelligent assistant.
At present, the corpus mining method mainly includes random selection, corpus mining according to keywords, corpus mining based on an active learning algorithm of edge probability, and the like. The random selection refers to random sampling of the unmarked corpus and then labeling by the labeling personnel. According to the keywords, the corpus mining needs to design a plurality of keywords according to skills, then the corpus containing the keywords is dug from the unmarked corpus set, and then the keywords are given to the marking personnel for marking. The active learning algorithm based on the marginal probability needs to initialize a plurality of starting corpuses, a classification model is trained based on the starting corpuses, then the classification model is used for predicting all the unmarked corpuses to obtain the scores of the unmarked corpuses, and finally the corpuses with the scores between threshold edges are selected to be marked by a marking person.
However, the corpus mining method has the following problems: the randomly selected corpus mining method is time-consuming and labor-consuming, and has extremely low efficiency; the efficiency of the bowl fern is improved to a certain extent by performing corpus mining according to the keywords, but the corpus distribution is easy to incline due to the fact that the selection of the keywords is seriously relied on, or the corpus mining of certain cold doors is omitted; and for the corpus mining method based on the active learning algorithm of the edge probability, some corpora similar to the starting corpus are easy to be mined, and the coverage related to corpus mining is difficult to expand.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical defects, especially the technical defects that the corpus mining efficiency is low and the coverage related to the corpus mining result is difficult to expand.
In a first aspect, a corpus mining method based on active learning is provided, including:
acquiring unmarked corpora;
classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
selecting the unmarked corpus with inconsistent first classification types and the classification scores meeting the preset conditions as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked.
In a possible implementation manner, the corpus mining method based on active learning further includes:
and training the at least two classifiers based on the pre-configured cold start corpus serving as the training sample to obtain at least two corpus classification models.
In a possible implementation manner, the step of training at least two classifiers based on a pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models includes:
acquiring a pre-configured cold start corpus serving as a training sample;
extracting N-gram text features of the cold-start corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold-start corpus; wherein N is a positive integer greater than or equal to 1;
recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold-start corpus;
and training the at least two classifiers by adopting an extensible machine learning library based on the feature expression to obtain at least two corpus classification models.
In one possible implementation, the step of screening the text features of the N-gram to generate the N-gram dictionary of the cold-start corpus comprises:
counting the occurrence frequency of the N-gram text features of the cold start corpus;
and screening out the N-gram text characteristics with the occurrence frequency within a preset frequency range to obtain the N-gram dictionary of the cold-start corpus.
In one possible implementation manner, the step of extracting the N-gram text features of the cold-start corpus comprises:
and extracting the N-gram text features of the cold-start corpus segment by segment according to the preset segment length N based on the starting identifier and the ending identifier which are added to the beginning position and the ending position of the cold-start corpus segment in advance.
In a possible implementation manner, the step of classifying the unlabeled corpus by using at least two corpus classification models to obtain a first classification type and a classification score output by the at least two corpus classification models includes:
extracting the N-gram text features of the unmarked corpus, and performing feature vectorization on the N-gram text features of the unmarked corpus to obtain the feature vector of the unmarked corpus;
classifying the unlabeled corpus by using at least two corpus classification models according to the feature vector of the unlabeled corpus to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
In a possible implementation manner, the step of selecting the un-labeled corpus with the inconsistent first classification type and the classification score meeting the preset condition as the corpus to be labeled includes:
adding the classification scores of the selected unmarked corpora with inconsistent first classification types, calculating to obtain the total score of the selected unmarked corpora, and sequencing the selected unmarked corpora in a descending order according to the total score;
and according to the descending sorting result, acquiring a plurality of unmarked corpora which are sorted in front as to-be-marked corpora.
In a possible implementation manner, the step of performing secondary classification processing on the corpus to be labeled to obtain a second classification type of the corpus to be labeled includes:
performing secondary classification labeling according to the attributes of the linguistic data to be labeled to obtain new labeled linguistic data;
and taking the result of the secondary classification and labeling as a second classification type of the new labeled corpus.
In a possible implementation manner, after the step of determining the second classification type of the corpus to be labeled, the method further includes:
and inputting the new labeled corpus and the cold start corpus as new training samples into the at least two classifiers, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
In a second aspect, a corpus mining device based on active learning is provided, the device including:
the unmarked corpus acquisition module is used for acquiring unmarked corpora;
the first classification type obtaining module is used for classifying the unlabeled corpus by utilizing at least two corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
and the second classification type determination module is used for selecting the unmarked corpus of which the first classification type is inconsistent and the classification score meets the preset condition as the corpus to be marked, and performing secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and the executed active learning-based corpus mining method.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, the computer program, when executed by a processor, implementing an active learning-based corpus mining method.
The beneficial effect that technical scheme that this application provided brought is:
classifying the unlabeled corpus by using at least two corpus classification models through obtaining the unlabeled corpus to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus; and selecting the unmarked corpus with inconsistent first classification types and the classification scores meeting the preset conditions as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked, so that the cold start corpus serving as the training sample is excavated, the unmapped but related expanded corpus is not obtained, the coverage of corpus excavation is favorably widened, and the generalization of corpus excavation is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic diagram of an implementation environment related to a corpus mining method based on active learning according to an embodiment of the present application;
FIG. 2 is a flowchart of a corpus mining method based on active learning according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method for obtaining a corpus classification model according to the embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a corpus mining device based on active learning according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
The following describes an application scenario related to an embodiment of the present application.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, smart assistants and the like.
The corpus mining method, apparatus and electronic device based on active learning provided by this embodiment are applied to intelligent questioning and answering applications such as intelligent assistants and intelligent customer service in artificial intelligence. In these intelligent question answering applications, such as intelligent assistants, the user's question needs to be correctly understood to better respond to the user's question.
In order to better explain the technical solution of the present application, a certain application environment to which the active learning-based corpus mining method of the present solution can be applied is shown below. Fig. 1 is a schematic diagram of an implementation environment related to a corpus mining method based on active learning according to an embodiment of the present application, and referring to fig. 1, the implementation environment may include: a terminal 101 and a server 102. The terminal 101 and the server 102 are communicatively connected.
Applications may be installed on the terminal 101, including applications capable of intelligent question answering, such as map navigation applications, social applications, life service applications, and the like. The embodiment of the present application does not specifically limit the type of the application.
The terminal 101 may be one terminal or a plurality of terminals. The terminal 101 includes at least one of a vehicle-mounted terminal, a smart phone, a smart television, a smart speaker, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, Moving Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, Moving Picture Experts compression standard Audio Layer 4) player, a palm computer, a notebook computer, and a desktop computer.
The user places a song, for example, by initiating a query to an application installed on the terminal 101. The application program recognizes the query intention of the user through the terminal 101 or the server 102 according to the query initiated by the voice input or the text input of the user, calls a corresponding function from the server 102 and performs data processing, feeds back a processing result to the terminal 101, and the terminal 101 plays the song through a preset music playing program. Alternatively, the user initiates a query to an application installed on the terminal 101, for example, how is the weather of the achievements? The server 102 obtains the weather information of the weather according to the inquiry of the user, and feeds the weather information back to the user in a text display or voice broadcast mode.
Of course, the technical solution provided in the embodiment of the present application may also be applied to other positioning scenarios, which are not listed here.
Based on the application scenario, it is necessary to expand the linguistic data of the intelligent question-and-answer application, such as the intelligent assistant, so as to learn the linguistic data, and to meet the inquiry of different requirements of the user.
At present, the related technologies mostly adopt the ways of randomly selecting, performing corpus mining according to keywords, performing corpus mining based on an active learning algorithm of edge probability, and the like to perform corpus mining so as to expand the corpus. The random selection refers to random sampling of the unmarked corpus and then labeling by the labeling personnel. According to the keywords, the corpus mining needs to design a plurality of keywords according to skills, then the corpus containing the keywords is dug from the unmarked corpus set, and then the keywords are given to the marking personnel for marking. For example, assuming we are to mine the corpus of musical skills, the following keywords may be specified: playing, listening to the head, music, song, etc., and then selecting the corpus containing any keywords from the unlabeled corpus set. For example, "play a song" and "i want to listen to a piece of music" etc. The active learning algorithm based on the marginal probability needs to initialize a plurality of starting corpora, a classification model is trained based on the starting corpora, then the classification model is used for predicting all the unmarked corpora, and finally, the corpora with the score between threshold edges are selected to be marked by a marking person.
However, the corpus mining method described above is to mine a corpus similar to or homogeneous to the corpus from the unlabeled corpus, and cannot meet the generalization of corpus mining, that is, mine other corpora that belong to the same skill as the corpus but are not covered by the corpus, for example, the training expectation of the existing corpus relates to the requirements of playing, downloading, searching, etc. of music skills, but does not relate to the requirement of "sharing", mine a "shared" corpus that belongs to the same music skill from the unlabeled corpus, and supplement the corpus to the corresponding corpus.
The active learning-based corpus mining method and device and the electronic equipment aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The scheme provided by the embodiment of the application relates to a corpus mining method, a corpus mining device and electronic equipment based on active learning, and further relates to a computer-readable storage medium, which is specifically explained by the following embodiments:
fig. 2 is a flowchart of a corpus mining method based on active learning according to an embodiment of the present application, where the corpus mining method based on active learning is executed on a server.
Specifically, as shown in fig. 2, the corpus mining method based on active learning may include the following steps:
and S210, obtaining the unmarked linguistic data.
The corpus can be understood as a search sentence of the user, including voice, text, picture input, etc. of the user. The unlabeled corpora can be from corpora used by a user search history recorded in a user platform, or corpora used by a user obtained from a web crawler, and the like.
In this embodiment, the unlabeled corpora may be classified, that is, the unlabeled corpora may be divided into a positive sample corpus that meets the requirement condition of the set skill and satisfies the corpus requirement of a certain application and a negative sample corpus that does not satisfy the corpus requirement of a certain application.
S220, classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
And classifying the unlabeled corpora by using at least two corpus classification models obtained by pre-training. In this embodiment, at least two classifiers may be trained based on a pre-configured cold-start corpus serving as a training sample, so as to obtain at least two corpus classification models.
Because the corpus classification principle of the corpus classification model obtained based on different classifier training is different, the classification result of the same corpus may also be different. The method comprises the steps of sorting unmarked corpora into sequences to obtain an unmarked corpus sequence, inputting the unmarked corpora into at least two corpus classification models one by one respectively for classification, and obtaining a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unmarked corpora.
In this embodiment, the first classification type includes a positive sample corpus and a negative sample corpus. The classification score is used for representing the degree of correlation between the set skills of the annotations of the unmarked corpus and the cold start corpus, and the higher the classification score is, the higher the correlation between the unmarked corpus and the positive sample corpus of the set skills of the annotations of the cold start corpus is, and otherwise, the lower the correlation between the unmarked corpus and the positive sample corpus of the set skills of the annotations of the cold start corpus is.
And S230, selecting the unmarked corpus with inconsistent first classification types and the classification scores meeting the preset conditions as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked.
For the same un-labeled corpus, if the first classification types output by the corpus classification models are all consistent, the more similar the type of the un-labeled corpus and the type of the cold-start corpus serving as the training sample are. And for the same unmarked corpus, adding the classification scores output by the corpus classification models to obtain the total classification score of the unmarked corpus, wherein if the total classification score is higher, the type of the unmarked corpus is similar to that of the cold start corpus serving as a training sample.
To more clearly illustrate this solution, a detailed description is given in conjunction with the following table 1. Table 1 shows the classification results of the unlabeled corpora according to an embodiment.
TABLE 1 results of classification of unlabeled corpora
Figure BDA0002738453210000091
Wherein the content of the first and second substances,
Figure BDA0002738453210000092
the first classification type representing the corpus classification model output is a positive sample,
Figure BDA0002738453210000093
the first classification type representing the output of the corpus classification model is a negative example.
As can be seen from table 1 above, the classification of the same corpus by the three corpus classification models (SVM classification type, LR classification type, and naviebaes classification type) has the following three classification results:
the classification types output by the three corpus classification models are unanimously passed (namely, unanimously regarded as positive sample corpuses), the higher the total classification score is, the more similar the total classification score is to the cold-starting corpuses as training samples, for example, the unanimous corpuses such as 'playing music', 'i want to listen to a song', and the like are covered by the cold-starting corpuses, and no help is provided for expanding the corpuses, so that the unanimous corpuses do not need to be mined.
For the unlabeled corpora of which the three corpus classification models are uniform and fail (namely, the three corpora are uniformly considered as negative sample corpora), the lower the total classification score is, the less relevant the setting skill is. For example, "find down weather of tomorrow, it is obvious that the linguistic data belongs to weather skills, so the linguistic data does not belong to music skills, the linguistic data does not help to expand, and the linguistic data which are not labeled does not need to be mined.
For the unlabeled corpora with inconsistent classification types output by the three corpus classification models, the unlabeled corpora are most likely to be the positive sample corpora which are not covered by the cold start corpora used as the training sample and need to be mined, or the negative sample corpora which are difficult to distinguish, and need to be further classified to determine whether the unlabeled corpora are the extended corpora.
Based on this, in an embodiment, the selecting, in the step S230, the un-labeled corpus with the inconsistent first classification type and the classification score meeting the preset condition as the corpus to be labeled may include the following steps:
s2301, adding the classification scores of the selected unanimous first classification type un-labeled corpora, calculating to obtain the total score of the selected un-labeled corpora, and sequencing the selected un-labeled corpora in a descending order according to the total score.
In this embodiment, the unlabeled corpora with inconsistent first classification types output by at least two corpus classification models are selected, the classification scores output by each corpus classification model of the selected unlabeled corpora are added to obtain the total score of the unlabeled corpora, and the total score is sorted in a descending order according to the total classification score. As shown in Table 2, Table 2 is the selected and sorted unlabeled corpus.
TABLE 2 unlabeled corpora selected and sorted
Figure BDA0002738453210000101
As can be seen from table 2, the unlabeled corpora with inconsistent first classification type output by the selected three corpus classification models have a part of positive sample corpora belonging to music skills, for example, "listen to new song mojito of zhou jilun", and a part of negative sample corpora with extremely high correlation to music skills, for example, "download song of qili. It should be noted that the positive examples and the negative examples are divided according to the preset standard set by the user in advance, for example, if a certain standard divides "download" into negative examples, even if the corpus is related to the musical skill, the corpus belongs to the negative examples.
S2302, according to the descending sorting result, obtaining a plurality of unmarked corpora which are sorted in the front as the corpora to be marked.
And after the unmarked corpora of the first classification type output by the at least two corpus classification models are sorted according to the total classification score in a descending order, a plurality of unmarked corpora which are ranked at the front are obtained as the corpora to be marked to be subjected to secondary classification processing, so as to further verify whether the classification types of the unmarked corpora belong to positive sample corpora or negative sample corpora. The number of the obtained unmarked corpora at the top rank can be set according to the actual situation, such as the top 2000.
Further, secondary classification labeling is carried out according to the attributes of the linguistic data to be labeled to obtain new labeled linguistic data, and the result of the secondary classification labeling is used as a second classification type of the new labeled linguistic data.
In one embodiment, the secondary classification labeling can perform secondary classification labeling on the corpus to be labeled in a manual classification labeling mode, and in another embodiment, the secondary classification labeling can also perform secondary classification labeling through other corpus classification models. And performing two-class classification labeling according to whether the attribute of the linguistic data to be labeled meets a preset support rule, wherein if the function can be supported by the intelligent assistant cloud micro-particles, the linguistic data is positive sample linguistic data, and otherwise, the linguistic data is negative sample linguistic data.
In this embodiment, the corpora to be labeled, which are extracted in different first classification types and ranked 2000 before the descending order, are subjected to manual classification and labeling to obtain a second classification type of the corpora to be labeled, where the second classification type may be the same as or different from the first classification type. As shown in table 3, table 3 is the classification result of the corpus to be labeled of the secondary classification labeling.
TABLE 3 result of classifying the corpus to be labeled by the secondary classification labeling
Figure BDA0002738453210000111
Figure BDA0002738453210000121
After secondary classification labeling, M positive sample corpora and N negative sample corpora are obtained from the labels to be labeled, the positive sample corpora are led into the positive sample corpora, the negative sample corpora are led into the negative sample corpora to be stored, and one round of corpus mining is completed.
In this embodiment, the second classification type is used as a new classification tag of the corpus to be labeled, and the corpus to be labeled is labeled with the second classification type, so as to obtain a new labeled corpus.
In the corpus mining method based on active learning provided by this embodiment, the unlabeled corpus is obtained, and the at least two corpus classification models are used to classify the unlabeled corpus, so as to obtain a first classification type and a classification score, which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus; selecting the unmarked corpora with inconsistent first classification types and the classification scores meeting the preset conditions as the corpora to be marked, and carrying out secondary classification processing on the corpora to be marked to obtain the second classification types of the corpora to be marked, so that a plurality of corpora to be marked are selected according to the classification scores from the unmarked corpora with inconsistent first classification types output by at least two corpus classification models to serve as the corpora to be marked, and the corpora to be marked are likely to be corpora covered by the existing corpora, thereby being beneficial to expanding the positive sample corpora related to the set skills, supplementing the positive sample corpora related to the set skills and expanding the coverage of corpus mining. Furthermore, the corpus to be labeled with the classification scores meeting the preset conditions is selected for secondary classification processing, so that the workload of the secondary classification processing is reduced, and the corpus mining efficiency is improved.
In order to more clearly illustrate the technical solution of the present application, the following further describes a plurality of steps of the corpus mining method based on active learning.
Fig. 3 is a flowchart of a method for obtaining a corpus classification model by training according to an embodiment of the present application, and as shown in fig. 3, in an embodiment, the corpus mining method based on active learning further includes the following steps:
200. and training the at least two classifiers based on the pre-configured cold start corpus serving as the training sample to obtain at least two corpus classification models.
In one embodiment, the corpus classification model may be trained by:
s2001, acquiring a pre-configured cold start corpus serving as a training sample.
The cold start corpus is a corpus used for initially training the classifier, and is a plurality of corpora initialized for a set skill, such as a music skill. The cold start corpus may be obtained by manual writing or by machine writing. The classifier is a classifier for predicting which skill, intention or field the corpus learned by the deep learning algorithm belongs to, and in this embodiment, the classifier may be a corpus classifier, a semantic classifier, or the like.
For example, a music skill is initialized with 2000 pieces of cold start corpora by way of manual writing, such as: "play music", "come first and hear song", "i want to hear later", "Liudebua ice rain", "leave conutting theme song", etc. In order to ensure the corpus mining effect, the cold-start corpus should cover the intentions possibly involved by the user in use as much as possible, such as "play", "search singer", "search lyric", "add collection", "pause", and the like.
In this embodiment, a plurality of corpora related to the set skill are used as the positive sample corpora, and the corpora related to the remaining skills are used as the negative sample corpora. For example, 2000 initialized music skill corpora are used as positive sample corpora, and the remaining skill corpora, such as weather, news, movie and television, are used as negative sample corpora. And inputting cold start corpora containing the positive sample corpora and the negative sample corpora into at least two classifiers, training the at least two classifiers, and learning the classification types of the positive samples and the negative samples of the corpora to obtain corpus classification models corresponding to the at least two classifiers.
In this embodiment, the classifier includes at least two classifiers, such as a Support Vector Machine (SVM) classifier, a Logistic Regression (LR) classifier, and a Naive Bayes (NB) classifier. When training the classifier, the classifier may be trained by using an Application Program Interface (API) provided by a Machine learning library (MLib) of Spark, and when training the classifier, the default parameters provided by the MLib are selected for training.
The SVM classifier, the LR classifier, and the NB classifier are all common classifiers, and the classification principle thereof is not described in detail. Of course, in other embodiments, other classifiers may be used to train and learn the cold start samples as training samples.
S2002, extracting N-gram text features of the cold-start corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold-start corpus.
The basic idea of the N-gram algorithm is to segment text contents through a sliding window with the size of N to form a byte segment sequence with the length of N, wherein each byte segment becomes a gram, then frequency statistics is carried out on all the grams, filtering is carried out according to a preset threshold value to form a key gram list, the list is a feature vector space of the text contents, and each gram in the list is a feature vector dimension. N in the N-gram is a positive integer, and the N-gram can be a 1-gram, a 2-gram, a 3-gram and the like.
Since the same N-gram text feature may appear in different locations of the cold start corpus, e.g., the 1-gram text feature "put", may appear at the beginning of the corpus as "put a song", or may appear at a non-beginning of the corpus as "play music". In this embodiment, a start identifier "B" and an end identifier "E" are added to the beginning position and the end position of the cold-start corpus, respectively, so that the extracted text features have certain position information.
In one embodiment, extracting the N-gram text features of the cold-start corpus comprises the following implementation modes:
extracting the N-gram text features of the cold start corpus segment by segment according to the preset segment length N based on the starting identifier and the ending identifier which are added to the beginning position and the ending position of the cold start corpus in advance; wherein N is a positive integer greater than or equal to 2.
In this embodiment, a start tag is added to the beginning of each cold start corpus in advance, and an end tag is added to the end of each cold start corpus. In this embodiment, a start identifier "B" may be added at the beginning position and an end identifier "E" may be added at the end position of each cold-start corpus in advance, for example, a start identifier and an end identifier are added in a "play a song" of the cold-start corpus, so as to obtain "a song is played by B" of the cold-start corpus.
When the starting identification symbol is detected, determining the position of the starting identification symbol as the beginning position of the corpus, or determining the next character position of the starting identification symbol as the beginning position of the corpus. And when the ending identification symbol is detected, determining the position of the ending identification symbol as the ending position of the corpus, or determining the last character position of the ending identification symbol as the ending position of the corpus. For example, when the previous character of the 2-gram text feature "put" is detected as "B", it can be determined that the "put" word in the corpus is located at the beginning position, and when the next character of the 2-gram text feature "song" is detected as "E", it can be determined that the "song" word in the corpus is located at the end position.
The 1-gram, 2-gram, and 3-gram text features of the cold start corpus were extracted separately as shown in Table 4.
TABLE 4N-gram text feature of the corpus "put a song
Figure BDA0002738453210000141
Figure BDA0002738453210000151
Of course, in other embodiments, the text features of the cold-start corpus may also be extracted by using a frequency method, a tf-idf (term frequency-inverse text frequency index) feature, a mutual information method, an N-Gram, a Word2Vec, and the like.
In one embodiment, the step of filtering the text features of the N-gram to generate the N-gram dictionary of the cold-start corpus in step S2102 may include the following steps:
and S2002-a, counting the occurrence frequency of the N-gram text features of the cold start corpus.
And for each cold-start corpus serving as a training sample, extracting the corresponding N-gram text features to obtain the N-gram text feature set of the cold-start corpus.
And counting the occurrence frequency of each N-gram text feature according to the N-gram text feature set, wherein the occurrence frequency of 1-gram text feature 'put' is 100 times, the occurrence frequency of 1-gram text feature 'song' is 500 times, the occurrence frequency of 2-gram text feature 'first' is 100 times, the occurrence frequency of 3-gram text feature 'first put' is 50 times and the like.
S2002-b, screening out the N-gram text features with the occurrence frequency within a preset frequency range to obtain an N-gram dictionary corresponding to the cold-start corpus.
In this embodiment, the N-gram text features with the occurrence frequency lower than a first preset threshold and higher than a second preset threshold are filtered, and the N-gram text features with the occurrence frequency within a preset frequency range are screened out to obtain the N-gram dictionary of the cold-start corpus. The N-gram dictionary, also known as an N-gram model core dictionary, refers to a set of core N-gram text features with the occurrence frequency within a preset frequency range.
And S2003, recording the corresponding position of the text feature of the N-gram in the N-gram dictionary as the feature expression of the cold-start corpus.
The N-gram dictionary may be represented in the form of an array, with each element in the array representing each N-gram textual feature in the N-gram dictionary. The position of each element in the N-gram dictionary can be represented by an index value, for example, the N-gram dictionary is a 3 x 3 array, and the position of each N-gram text feature of the N-gram dictionary is represented by index values 1, 2, 3, 4, 5, 6, 7, 8 and 9 respectively.
In the embodiment, each N-gram text feature of each cold-start corpus is obtained, and the position of each N-gram text feature in the N-gram dictionary is determined. It should be noted that, if a certain N-gram text feature of the cold-start corpus does not correspond to a corresponding text feature in the N-gram dictionary, the N-gram text feature of the cold-start corpus is discarded without recording.
For example, the 1-gram text feature of the cold-start corpus is "put", the 1 st position of the N-gram dictionary is represented by the number "1", the 1-gram text feature "head" of the cold-start corpus, corresponding to the 3 rd position of the N-gram dictionary, the position is represented by the number '3', the 2-gram text feature 'head put' of the cold-start corpus, corresponding to the 57 th position of the N-gram dictionary, using the number '57' to represent the position, repeating the steps, recording the corresponding position of the N-gram text feature of each cold-start corpus in the N-gram dictionary, obtaining the feature expression of each cold-start corpus, the feature expression can be represented by an ordered array, such as an ordered one-dimensional array, where each numeric element in the one-dimensional array represents a corresponding position index value of the N-gram text feature of the cold-start corpus in the N-gram dictionary.
And S2004, training the at least two classifiers by adopting an extensible machine learning library based on the feature expression to obtain at least two corpus classification models.
At least two classifiers are respectively connected to an Application Program Interface (API) provided by a Machine Learning Library (MLib) of the Spark, corresponding parameters, functions and the like are called, feature expressions of the cold-started corpus are input to the at least two classifiers for training, and at least two corpus classification models are obtained, if the trained classifiers are Support Vector Machine (SVM) classifiers, Logistic Regression (LR) classifiers and naive Bayes (naivie Bayes, NB) classifiers, the trained corpus classification models are respectively SVM corpus classification models, LR corpus classification models and NB corpus classification models.
In an embodiment, the classifying the unlabeled corpus by using at least two corpus classification models in step S220 to obtain a first classification type and a classification score output by the at least two corpus classification models may include the following steps:
s2201, extracting the N-gram text features of the unlabeled corpus, and performing feature vectorization on the N-gram text features of the unlabeled corpus to obtain a feature vector of the unlabeled corpus.
In this embodiment, N in the N-gram is a positive integer, and the N-gram can be a 1-gram, a 2-gram, a 3-gram, and the like. Respectively extracting text features of 1-gram, 2-gram and 3-gram of the unlabeled corpus. And respectively adding a start identifier "B" and an end identifier "E" at the beginning position and the end position of the unmarked corpus so as to enable the extracted text features to have certain position information.
Extracting the N-gram text features of the unmarked corpus, determining the position of the N-gram text features in the N-gram dictionary, and performing feature vectorization on the N-gram text features of the unmarked corpus to obtain the feature vector of the unmarked corpus and obtain the N-gram feature vector of the unmarked corpus.
For example, the unlabeled corpus is "play head music", and the extracted N-gram text features are respectively: 1-gram text characteristics: playing, head, music and music; 2-gram text characteristics: b, playing, head, music and music E; 3-gram text feature: b, playing head sound, head music and music E.
Determining the position of each N-gram text feature of the unlabeled corpus "chunking music" in the N-gram dictionary, i.e. the corresponding index value, as shown in Table 5:
TABLE 5N-gram text characteristics and their location without notation "play head music" corpus
Figure BDA0002738453210000171
And (4) carrying out feature vectorization on the positions of the N-gram text features in the N-gram dictionary to obtain feature vectors of the unlabeled corpus, namely [1,3,25,26,49,57,98,109,125,198,247,305 and 313 ].
S2202, classifying the unlabeled corpus by using at least two corpus classification models according to the feature vector of the unlabeled corpus to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
And inputting the feature vectors of the unlabeled corpora into at least two corpus classification models to classify the unlabeled corpora, so as to obtain a first classification type and a classification score which are output by the at least two corpus classification models to the unlabeled corpora and are used for classification. For example, the unlabeled corpus "chunking music" is input into the SVM corpus classification model, the LR corpus classification model and the NB corpus classification model, and the output classification scores are respectively: 0.98, 0.95 and 0.96, the first class type of the output is +, + and +, wherein "+" indicates that the first class type is a positive sample and "-" indicates that the first class type is a negative sample.
In an embodiment, after determining the second classification type of the corpus to be labeled in step S230, the method further includes the following steps:
s240, taking the new labeled corpus and the cold-start corpus as new training samples, inputting the new labeled corpus and the cold-start corpus into the at least two classifiers, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
In this embodiment, a new labeled corpus obtained by mining a previous corpus and a corpus previously used as a training sample, such as a cold-start corpus, are input into at least two classifiers as a new training sample, the classifiers are retrained again, the step of training the at least two classifiers in step S200 is executed again to obtain at least two corpus classification models, so as to obtain at least two updated corpus classification models, and the steps from step S210 to step S230 are repeatedly executed on a next batch of unlabeled corpuses by using the updated corpus classification models, so as to complete the next round of corpus mining. By analogy, multi-round iteration is carried out, and the efficiency and the accuracy of corpus mining are improved.
The above examples are merely used to assist in explaining the technical solutions of the present disclosure, and the drawings and specific flows related thereto do not constitute a limitation on the usage scenarios of the technical solutions of the present disclosure.
The following describes in detail a related embodiment of the corpus mining device based on active learning.
Fig. 4 is a schematic structural diagram of an active learning-based corpus mining device according to an embodiment of the present invention, and as shown in fig. 4, the active learning-based corpus mining device 200 may include: the unlabeled corpus acquiring module 210, the first classification type obtaining module 220, and the second classification type determining module, wherein:
an unlabeled corpus acquiring module 210, configured to acquire unlabeled corpora;
a first classification type obtaining module 220, configured to classify the unlabeled corpus by using at least two pre-trained corpus classification models, so as to obtain a first classification type and a classification score, which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
the second classification type determining module 230 is configured to select an un-labeled corpus, of which the first classification type is inconsistent and the classification score meets a preset condition, as a corpus to be labeled, and perform secondary classification processing on the corpus to be labeled to obtain a second classification type of the corpus to be labeled.
The corpus mining device based on active learning provided by the embodiment is characterized in that a plurality of unmarked corpuses with inconsistent first split types are selected as corpora to be marked according to classification scores from the unmarked corpuses which are output by at least two preset corpus classification models trained by cold start corpuses and classified according to the unmarked corpuses, and the unmarked corpuses are possibly corpora covered by the existing corpuses, so that the corpus mining device is beneficial to expanding the positive sample corpuses related to set skills, supplementing the positive sample corpus related to the set skills and expanding the coverage of corpus mining. Furthermore, the corpus to be labeled with the classification scores meeting the preset conditions is selected for secondary classification processing, so that the workload of the secondary classification processing is reduced, and the corpus mining efficiency is improved.
In one possible implementation, the corpus mining device 200 based on active learning may further include: and the corpus classification model training module is used for training the at least two classifiers based on the pre-configured cold-start corpus serving as a training sample to obtain at least two corpus classification models.
The corpus classification model training module comprises: the system comprises a cold-start corpus obtaining unit, an N-gram dictionary generating unit, a feature expression obtaining unit and a corpus classification model obtaining unit;
the system comprises a cold start corpus acquiring unit, a training sample acquiring unit and a training unit, wherein the cold start corpus acquiring unit is used for acquiring a pre-configured cold start corpus which is used as a training sample; the N-gram dictionary generating unit is used for extracting N-gram text features of the cold-start corpus and screening the N-gram text features to generate an N-gram dictionary of the cold-start corpus; the feature expression obtaining unit is used for recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold-start corpus; and the corpus classification model obtaining unit is used for training the at least two classifiers by adopting an extensible machine learning library based on the characteristic expression to obtain at least two corpus classification models.
In one possible implementation, the N-gram dictionary generating unit includes: counting the appearance frequency of the subunits and the N-gram dictionary to obtain subunits;
the occurrence frequency counting subunit is used for counting the occurrence frequency of the N-gram text features of the cold start corpus; and the N-gram dictionary obtaining subunit is used for screening out the N-gram text features with the occurrence frequency within a preset frequency range to obtain the N-gram dictionary of the cold-start corpus.
In one possible implementation, the N-gram dictionary generating unit includes: and the text feature extraction unit is used for extracting the N-gram text features of the cold-start corpus section by section according to the preset byte segment length N based on the start identifier and the end identifier which are added to the start position and the end position of the cold-start corpus section in advance.
In one possible implementation, the first classification type obtaining module 220 includes: a feature vector obtaining unit and a first classification type obtaining unit;
the system comprises a feature vector obtaining unit, a semantic analysis unit and a semantic analysis unit, wherein the feature vector obtaining unit is used for extracting the N-gram text features of the unlabeled corpus and carrying out feature vectorization on the N-gram text features of the unlabeled corpus to obtain the feature vector of the unlabeled corpus; and the first classification type obtaining unit is used for classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vector of the unlabeled corpus to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
In one possible implementation, the second classification type determining module 230 includes: the system comprises an unmarked corpus ordering unit and a to-be-marked corpus obtaining unit;
the unmarked corpus sorting unit is used for adding the classification scores of the selected unmarked corpora with inconsistent first classification types, calculating to obtain the total score of the selected unmarked corpora, and sorting the selected unmarked corpora in a descending order according to the total score; and the to-be-labeled corpus acquiring unit is used for acquiring a plurality of unlabeled corpora which are ranked in the front as to-be-labeled corpora according to the descending ranking result.
In one possible implementation, the second classification type determining module 230 includes: a new labeled corpus obtaining unit and a second classification type obtaining unit;
the new labeling corpus obtaining unit is used for carrying out secondary classification labeling according to the attribute of the corpus to be labeled to obtain a new labeling corpus; and the second classification type obtaining unit is used for taking the result of the secondary classification labeling as the second classification type of the new labeled corpus.
In one possible implementation, the corpus mining device 200 further includes: and the returning module is used for inputting the new labeled corpus and the cold start corpus into the at least two classifiers as new training samples, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
The corpus device based on active learning of this embodiment can execute the corpus method based on active learning shown in the foregoing embodiments of this application, and the implementation principles thereof are similar, and will not be described herein again.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the generalization of the corpus excavation is improved, the covering surface of the corpus is widened, and the excavation efficiency of the corpus is improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 4000 shown in fig. 5 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the embodiment of the application can improve generalization of corpus excavation, broaden covering surface of corpus and improve corpus excavation efficiency.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A computer device, such as an electronic device, having a processor that reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions, such that the computer device, when executed, implements:
acquiring unmarked corpora;
classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
selecting the unmarked corpus with inconsistent first classification types and the classification scores meeting the preset conditions as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases form a limitation on the module itself, for example, the unlabeled corpus acquiring module may also be described as a "module acquiring unlabeled corpus".
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A corpus mining method based on active learning is characterized by comprising the following steps:
acquiring unmarked corpora;
classifying the unlabeled corpus by using at least two pre-trained corpus classification models to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus;
selecting the unmarked corpus of which the first classification type is inconsistent and the classification score meets the preset condition as the corpus to be marked, and carrying out secondary classification processing on the corpus to be marked to obtain a second classification type of the corpus to be marked.
2. The corpus mining method based on active learning of claim 1, further comprising:
and training the at least two classifiers based on the pre-configured cold start corpus serving as the training sample to obtain at least two corpus classification models.
3. The corpus mining method according to claim 2, wherein the step of training at least two classifiers based on pre-configured cold-start corpus as training samples to obtain at least two corpus classification models comprises:
acquiring a pre-configured cold start corpus serving as a training sample;
extracting N-gram text features of the cold-start corpus, and screening the N-gram text features to generate an N-gram dictionary of the cold-start corpus; wherein N is a positive integer greater than or equal to 1;
recording the corresponding position of the N-gram text feature in the N-gram dictionary as the feature expression of the cold-start corpus;
and training at least two classifiers by adopting an extensible machine learning library based on the feature expression to obtain at least two corpus classification models.
4. The active learning-based corpus mining method according to claim 3, wherein the step of generating the N-gram dictionary of the cold-start corpus by filtering the text features of the N-gram comprises:
counting the occurrence frequency of the N-gram text features of the cold-start corpus;
and screening out the N-gram text characteristics of which the occurrence frequency is within a preset frequency range to obtain the N-gram dictionary of the cold-start corpus.
5. The active learning-based corpus mining method according to claim 3, wherein said step of extracting N-gram text features of said cold-start corpus comprises:
and extracting the N-gram text features of the cold-start corpus segment by segment according to the preset segment length N based on the starting identifier and the ending identifier which are added to the beginning position and the ending position of the cold-start corpus segment in advance.
6. The corpus mining method according to claim 1, wherein the step of classifying the unlabeled corpus using at least two pre-trained corpus classification models to obtain a first classification type and a classification score outputted by at least two corpus classification models comprises:
extracting the N-gram text features of the unmarked corpus, and performing feature vectorization on the N-gram text features of the unmarked corpus to obtain the feature vector of the unmarked corpus;
classifying the unlabeled corpus by utilizing at least two corpus classification models according to the feature vector of the unlabeled corpus to obtain a first classification type and a classification score which are output by the at least two corpus classification models and are used for classifying the unlabeled corpus.
7. The corpus mining method according to claim 1, wherein the step of selecting the unlabeled corpus with the inconsistent first classification type and the classification score meeting the preset condition as the corpus to be labeled comprises:
adding the classification scores of the selected unmarked corpora with inconsistent first classification types, calculating to obtain the total score of the selected unmarked corpora, and sequencing the selected unmarked corpora in a descending order according to the total score;
and according to the descending sorting result, acquiring a plurality of unmarked corpora which are sorted in the front as to-be-marked corpora.
8. The corpus mining method according to claim 1, wherein the step of performing a secondary classification process on the corpus to be labeled to obtain a second classification type of the corpus to be labeled comprises:
performing secondary classification labeling according to the attributes of the linguistic data to be labeled to obtain new labeled linguistic data;
and taking the result of the secondary classification and labeling as a second classification type of the new labeled corpus.
9. The corpus mining method according to claim 8, wherein the step of determining the second classification type of the corpus to be labeled further comprises:
and inputting the new labeled corpus and the cold-start corpus as new training samples into at least two classifiers, and returning to execute the step of training the at least two classifiers to obtain at least two corpus classification models.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the active learning-based corpus mining method according to any one of claims 1-9.
CN202011141662.2A 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment Active CN113407713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011141662.2A CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011141662.2A CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Publications (2)

Publication Number Publication Date
CN113407713A true CN113407713A (en) 2021-09-17
CN113407713B CN113407713B (en) 2024-04-05

Family

ID=77677366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011141662.2A Active CN113407713B (en) 2020-10-22 2020-10-22 Corpus mining method and device based on active learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN113407713B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
WO2015003143A2 (en) * 2013-07-03 2015-01-08 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN110008990A (en) * 2019-02-22 2019-07-12 上海拉扎斯信息科技有限公司 More classification methods and device, electronic equipment and storage medium
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN110751953A (en) * 2019-12-24 2020-02-04 北京中鼎高科自动化技术有限公司 Intelligent voice interaction system for die-cutting machine
CN111177374A (en) * 2019-12-13 2020-05-19 航天信息股份有限公司 Active learning-based question and answer corpus emotion classification method and system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015003143A2 (en) * 2013-07-03 2015-01-08 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107644235A (en) * 2017-10-24 2018-01-30 广西师范大学 Image automatic annotation method based on semi-supervised learning
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110008990A (en) * 2019-02-22 2019-07-12 上海拉扎斯信息科技有限公司 More classification methods and device, electronic equipment and storage medium
CN110298032A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Text classification corpus labeling training system
CN111177374A (en) * 2019-12-13 2020-05-19 航天信息股份有限公司 Active learning-based question and answer corpus emotion classification method and system
CN110751953A (en) * 2019-12-24 2020-02-04 北京中鼎高科自动化技术有限公司 Intelligent voice interaction system for die-cutting machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZIQI ZHANG ET AL.: "The LODIE team (University of Sheffield) Participation at the TAC2015 Entity Discovery Task of the Cold Start KBP Track", 《HTTPS://TAC.NIST.GOV/PUBLICATIONS/2015/PARTICIPANT.PAPERS/TAC2015.LODIE.PROCEEDINGS.PDF》, 30 November 2015 (2015-11-30), pages 1 - 11 *
艾长青: "基于用户行为和项目内容的混合推荐算法研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 February 2018 (2018-02-15), pages 138 - 2666 *

Also Published As

Publication number Publication date
CN113407713B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN108804532B (en) Query intention mining method and device and query intention identification method and device
WO2022078102A1 (en) Entity identification method and apparatus, device and storage medium
US20200279002A1 (en) Method and system for processing unclear intent query in conversation system
CN108920666B (en) Semantic understanding-based searching method, system, electronic device and storage medium
US9875301B2 (en) Learning multimedia semantics from large-scale unstructured data
CN105786977B (en) Mobile search method and device based on artificial intelligence
CN110659366A (en) Semantic analysis method and device, electronic equipment and storage medium
CN111695345B (en) Method and device for identifying entity in text
CN110321537B (en) Method and device for generating file
US10671666B2 (en) Pattern based audio searching method and system
US20150279390A1 (en) System and method for summarizing a multimedia content item
US11954097B2 (en) Intelligent knowledge-learning and question-answering
CN107527619A (en) The localization method and device of Voice command business
CN111324700A (en) Resource recall method and device, electronic equipment and computer-readable storage medium
CN112906381B (en) Dialog attribution identification method and device, readable medium and electronic equipment
CN113128557B (en) News text classification method, system and medium based on capsule network fusion model
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN111753126B (en) Method and device for video dubbing
CN112906380A (en) Method and device for identifying role in text, readable medium and electronic equipment
CN114003682A (en) Text classification method, device, equipment and storage medium
CN111428011B (en) Word recommendation method, device, equipment and storage medium
CN111078849A (en) Method and apparatus for outputting information
CN111444321B (en) Question answering method, device, electronic equipment and storage medium
CN109492126B (en) Intelligent interaction method and device
CN114298007A (en) Text similarity determination method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051736

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant