CN108491389B

CN108491389B - Method and device for training click bait title corpus recognition model

Info

Publication number: CN108491389B
Application number: CN201810246454.5A
Authority: CN
Inventors: 祁斌川
Original assignee: Hangzhou Langhe Technology Co Ltd
Current assignee: Hangzhou Netease Shuzhifan Technology Co ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2021-10-08
Anticipated expiration: 2038-03-23
Also published as: CN108491389A

Abstract

The invention provides a training method and a training device for a click bait title corpus recognition model. The method comprises the following steps: inputting a corpus sample set into a machine learning model, wherein the corpus sample set comprises corpus samples identified as click bait title corpus and random corpus samples, and the ratio of the corpus samples identified as click bait title corpus to all corpus samples in the corpus sample set is smaller than a preset ratio threshold; determining whether the ratio of the number of the corpus samples of the click bait heading corpus to the number of the corpus samples in the corpus sample set, which is identified by the machine learning model, meets a predetermined condition; and if the determined ratio meets the preset condition, judging to stop training the machine learning model. The method and the device improve the efficiency of establishing the click bait title corpus identification model.

Description

Method and device for training click bait title corpus recognition model

Technical Field

The invention relates to the field of communication, in particular to a training method and a training device for a click bait title corpus recognition model.

Background

With the development of the internet, the internet platform has emerged a lot of online news media (content producers, including professional media, self-media, etc.). This type of news media is revenue proportional to the number of clicks the reader makes on the content they produce. Therefore, in order to obtain a high click-through amount, such news media often works on the title of the produced content, and produces a title completely inconsistent with the content to attract the attention of the reader, wherein the title is a click decoy title (clickbait), namely a commonly-called headline party.

One prior art method of identifying click-through decoy titles is to employ a machine learning model. A plurality of corpus samples manually pre-labeled with click-through bait headings and a plurality of corpus samples manually pre-labeled without click-through bait headings are input into the machine learning model for training. And extracting features from the samples by the machine learning model, and calculating the features through a constructed objective function to obtain a recognition result of whether the recognition result is a click bait title or not. The machine learning model can determine the coefficients of the objective function through learning according to the extracted features and the known recognition result. Thus, after the corpus with the title is input into the trained machine learning model, the machine learning model outputs the recognition result. The method has the defects that the labor cost is high when a large number of corpus samples with the clicking bait titles and the corpus samples without the clicking bait titles are labeled in advance, and the labeling accuracy rate is different from person to person, so that the efficiency of establishing the clicking bait title corpus identification model is low, and the identification accuracy rate is low.

Disclosure of Invention

It is an object of the present invention to improve the efficiency of creating a click bait title corpus recognition model.

According to a first aspect of the embodiments of the present invention, a method for training a click-bait heading corpus recognition model is disclosed, which includes:

inputting a corpus sample set into a machine learning model, wherein the corpus sample set comprises corpus samples identified as click bait title corpus and random corpus samples, and the ratio of the corpus samples identified as click bait title corpus to all corpus samples in the corpus sample set is smaller than a preset ratio threshold;

determining whether the ratio of the number of the corpus samples of the click bait heading corpus to the number of the corpus samples in the corpus sample set, which is identified by the machine learning model, meets a predetermined condition;

and if the determined ratio meets the preset condition, judging to stop training the machine learning model.

In one embodiment, the threshold proportion is 10%.

In one embodiment, the predetermined condition includes:

the determined ratio falls within an interval around the predetermined ratio. In one embodiment, the interval around the predetermined ratio includes an interval whose endpoints are the predetermined ratio minus a specific value, and the predetermined ratio plus a specific value.

In one embodiment, the method further comprises:

and if the determined proportion does not meet the preset condition, reconstructing the corpus sample set and inputting the corpus sample set into the machine learning model until the determined proportion meets the preset condition.

In one embodiment, the predetermined duty ratio β is determined according to the following formula:

β＝(M+N·α)/(M+N)，

wherein, M is the number of corpus samples identified as the click-bait title corpus in the corpus sample set, N is the number of the random corpus samples, and α is the probability that the title corpus is the click-bait title corpus, which is counted in advance.

In an embodiment, the determining to stop training the machine learning model if the determined ratio satisfies a predetermined condition specifically includes:

if the determined ratio meets a preset condition, inputting a plurality of test corpora into the machine learning model, and outputting the recognition results of the plurality of test corpora by the machine learning model;

receiving a determination of the correctness of each recognition result;

and judging to stop training the machine learning model according to the judgment result of the correctness.

In an embodiment, the determining to stop training the machine learning model according to the determination result of correctness specifically includes:

and if the ratio of the correct recognition result to the total number of the recognition results exceeds a preset correct rate threshold value, judging to stop training the machine learning model.

In one embodiment, after inputting the corpus sample set into the machine learning model, the method further comprises:

and configuring the machine learning model to extract the characteristics of the corpus samples in the corpus sample set, wherein the machine learning model identifies click bait title corpus based on the characteristics.

In one embodiment, the features include word features. The configuring the machine learning model to extract the features of the corpus samples in the corpus sample set specifically includes: and taking words in the title of the corpus sample as features.

In one embodiment, the features include word features. The configuring the machine learning model to extract the features of the corpus samples in the corpus sample set specifically includes: and combining the words in the title of the corpus sample with the part of speech of the words as features.

In one embodiment, the features include semantic features. The configuring the machine learning model to extract the features of the corpus samples in the corpus sample set specifically includes:

synthesizing word vectors corresponding to the words into which the titles of the corpus samples are divided into title semantic vectors;

calculating a hash vector of the title semantic vector;

encoding the hash vector into a sparsely encoded vector of a fixed element number;

determining element positions as the features based on the sparsely encoded vector.

In an embodiment, the determining the element position as the feature based on the sparsely encoded vector specifically includes: and determining the element positions n before the element values in the sparse coding vector as the features, wherein n is a preset positive integer.

In an embodiment, the synthesizing of the word vector corresponding to the word into which the title of the corpus sample is divided into the title semantic vector specifically includes:

segmenting the title of the corpus sample;

determining word vectors corresponding to the divided words;

and synthesizing the determined word vectors into a title semantic vector.

In one embodiment, the features include syntactic features. The configuring the machine learning model to extract the features of the corpus samples in the corpus sample set specifically includes:

constructing a syntax tree of a title of the corpus sample;

synthesizing a part-of-speech node of the syntax tree or all word nodes under the part-of-speech node into an extracted phrase;

and taking the extracted phrase as the feature.

In one embodiment, the taking the extracted phrase as the feature specifically includes:

determining the occurrence times of the extracted phrases in the corpus;

and taking phrases with the occurrence times exceeding a preset time threshold as the characteristics.

According to a second aspect of the embodiments of the present invention, a method for identifying a heading corpus of a click bait is disclosed, which includes:

according to the method of the first aspect of the embodiments of the present invention, a machine learning model is trained;

and inputting the linguistic data to be recognized into the machine learning model to obtain a recognition result of the clicked decoy title linguistic data.

According to a third aspect of the embodiments of the present invention, a training apparatus for a click-bait heading corpus recognition model is disclosed, which includes:

the input unit is used for inputting a corpus sample set into the machine learning model, wherein the corpus sample set comprises corpus samples identified as click bait title corpus and random corpus samples, and the ratio of the corpus samples identified as click bait title corpus to all the corpus samples in the corpus sample set is smaller than a preset ratio threshold;

the proportion determining unit is used for determining whether the proportion of the number of the corpus samples of the heading corpus identified as the click bait by the machine learning model to the number of the corpus samples in the corpus sample set meets a preset condition or not;

and the training stopping judging unit is used for judging to stop training the machine learning model if the determined ratio meets a preset condition.

According to a fourth aspect of the embodiments of the present invention, there is disclosed a click bait title corpus identifying apparatus, comprising:

according to the third aspect of the embodiment of the invention, the training device of the click bait title corpus recognition model is provided;

and the click bait title corpus identification unit is used for inputting the corpus to be identified into the machine learning model to obtain a click bait title corpus identification result.

According to a fifth aspect of the embodiments of the present invention, there is disclosed a training apparatus for a click bait title corpus recognition model, comprising:

a memory storing computer readable instructions;

a processor reading computer readable instructions stored by the memory to perform a method according to the first aspect of embodiments of the present invention.

According to a sixth aspect of the embodiments of the present invention, there is disclosed a click bait title corpus identifying apparatus, comprising:

a memory storing computer readable instructions;

a processor reading computer readable instructions stored by the memory to perform a method according to the second aspect of embodiments of the present invention.

According to a seventh aspect of embodiments of the present invention, a computer program medium is disclosed, on which computer readable instructions are stored, which, when executed by a processor of a computer, cause the computer to perform the method for training a click-to-bait title corpus recognition model according to the first aspect of the present invention.

According to an eighth aspect of embodiments of the present invention, there is disclosed a computer program medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to execute the click bait title corpus identification method according to the second aspect of the present invention.

Different from the method for manually marking all samples during training of the machine learning model in the prior art, the embodiment of the invention only marks the corpus sample which is exactly verified as the title corpus of the click bait, and other samples adopt random corpus samples, so that the marking cost during model training is greatly saved. In order to avoid extreme situations (generally, the clicked decoy heading corpus sample is few in all the random corpus samples, but in extreme situations, a large number of clicked decoy heading corpus samples are likely to happen in the random corpus samples), the distribution of the extreme samples is prevented by controlling the proportion of the clicked decoy heading corpus sample to meet a predetermined condition, so that the labeling cost is reduced, and the normal model training effect is ensured.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 illustrates a flowchart of a click-to-bait heading corpus recognition model training method according to an example embodiment of the invention.

FIG. 2 illustrates a detailed flow diagram for determining to stop training the machine learning model according to an example embodiment of the present invention.

FIG. 3 illustrates a flowchart of a click-to-bait heading corpus recognition model training method according to an example embodiment of the invention.

FIG. 4 illustrates a detailed flow diagram for configuring a machine learning model to extract features of a corpus sample in a corpus sample set according to an example embodiment of the present invention.

FIG. 5 illustrates a detailed flow diagram for configuring a machine learning model to extract features of a corpus sample in a corpus sample set according to an example embodiment of the present invention.

FIG. 6 illustrates a flowchart of a click bait title corpus identification method according to an example embodiment of the present invention.

FIG. 7 illustrates a block diagram of a click bait heading corpus recognition model training apparatus according to an example embodiment of the invention.

Fig. 8 is a block diagram illustrating a click bait title corpus recognition apparatus according to an example embodiment of the present invention.

Fig. 9 illustrates a syntax tree structure diagram according to an exemplary embodiment of the present invention.

Fig. 10 is a diagram illustrating a structure of a click bait heading corpus recognition model training apparatus or a click bait heading corpus recognition apparatus according to an exemplary embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more example embodiments. In the following description, numerous specific details are provided to give a thorough understanding of example embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, steps, and so forth. In other instances, well-known structures, methods, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The corpus refers to language material on the internet, such as articles, comments and the like on the internet. Generally speaking, a corpus has a title and a body. In a general corpus, a title reflects the content of a body text. The click decoy title is a title which is not consistent with the text content and is manufactured on the internet for increasing the click volume, and is commonly called a 'title party'. The click bait heading corpus is a corpus with a click bait heading.

To identify click bait title corpus, a machine learning method may be employed. That is, the machine learning model is trained using corpus sample sets containing click bait heading corpus samples and non-click bait heading corpus samples. After the machine learning model is trained, the corpus with the title is input into the machine learning model, and then the recognition result of whether the corpus is the click bait title corpus can be obtained. The click bait heading corpus identification model is a trained machine learning model for identifying whether the input corpus is the click bait heading corpus. The method for training the click-through bait heading corpus identification model refers to a method for training the click-through bait heading corpus identification model by using a corpus sample set.

In training a machine learning model, in the prior art, a large number of corpus samples manually pre-labeled with click bait titles and a large number of corpus samples manually pre-labeled without click bait titles are input into the machine learning model for training. And extracting features from the samples by the machine learning model, and calculating the features through a constructed objective function to obtain a recognition result of whether the recognition result is a click bait title or not. The machine learning model can determine the coefficients of the objective function through learning according to the extracted features and the known recognition result. Thus, after the corpus with the title is input into the trained machine learning model, the machine learning model outputs the recognition result. The method has the defects that the labor cost of marking samples is high, the marking accuracy rate is different from person to person, the modeling efficiency is low, and the identification accuracy rate is low.

As shown in fig. 1, the training method of the click-through bait title corpus recognition model according to an embodiment of the present invention includes:

step 110, inputting a corpus sample set into a machine learning model, wherein the corpus sample set comprises corpus samples identified as click bait title corpuses and random corpus samples;

step 120, determining whether the ratio of the number of the corpus samples of the keyword corpus identified as the click bait by the machine learning model to the number of the corpus samples in the corpus sample set meets a predetermined condition, wherein the ratio of the corpus samples of the keyword corpus identified as the click bait to all the corpus samples in the corpus sample set is smaller than a predetermined ratio threshold;

and step 130, if the determined ratio meets a preset condition, judging to stop training the machine learning model.

These steps are described in detail below.

In step 110, a corpus sample set is input into the machine learning model, the corpus sample set includes corpus samples identified as the click bait heading corpus and random corpus samples, wherein a ratio of the corpus samples identified as the click bait heading corpus to all corpus samples in the corpus sample set is less than a predetermined ratio threshold.

A corpus sample set refers to a set of corpora used to train a machine learning model. The corpus sample refers to a corpus used as a sample to train a machine learning model.

In one embodiment, step 110 includes:

receiving a corpus sample identified as a click bait title corpus;

acquiring a random corpus sample;

and inputting a corpus sample set consisting of the corpus samples identified as the click bait title corpus and the acquired random corpus samples into the machine learning model.

In one embodiment, the corpus samples that have been identified as the click bait title corpus are manually identified, labeled, and entered in advance. Receiving the corpus samples identified as the click bait heading corpus refers to receiving these input corpus samples. The method for identifying and labeling can be that the corpus samples are randomly captured from the Internet and manually identified whether the corpus samples are the corpus samples of the click bait title corpus, if so, a corpus sample of the click bait title corpus is found, and until the predetermined number of corpus samples of the click bait title corpus are found through the continuous capture and identification. For example, the predetermined number is 100, and the counter initial value is 0. The corpus samples of 1 click bait title corpus are identified by grabbing the corpus samples from the Internet and identifying, the counter value is added with 1, and when the counter value reaches 100, all 100 corresponding samples are the corpus samples which are identified as the click bait title corpus.

The random corpus sample is a corpus sample crawled randomly from the internet, which may be the click bait title corpus, but may also be the non-click bait title corpus. Click bait title corpus accounts for a small percentage of all corpora on the internet.

When the corpus sample identified as the click bait title corpus is received and the random corpus sample is obtained, the corpus sample set consisting of the corpus sample identified as the click bait title corpus and the obtained random corpus sample can be input into the machine learning model.

The advantage that the ratio of the corpus samples identified as the click bait title corpus to all corpus samples in the corpus sample set is smaller than the predetermined ratio threshold is that the advantage of saving the labeling cost of the embodiment of the present invention can be further highlighted in the case that the predetermined ratio threshold is set and controlled to be relatively low. For example, setting the predetermined percentage threshold to 10% means that no more than 10% of the total corpus sample set needs to be labeled, and the rest is not labeled, thereby greatly reducing the labeling cost.

In one embodiment, the predetermined proportion threshold is controlled at 10%. Repeated experiments of the inventor prove that the preset proportion threshold is controlled to be 10%, and the effects of ensuring the identification effect and controlling the marking cost can be achieved well.

In step 120, it is determined whether the ratio of the number of corpus samples identified by the machine learning model as the click bait title corpus to the number of corpus samples in the corpus sample set satisfies a predetermined condition.

As described above, when the machine learning model is trained and the corpus with the title is input to the machine learning model, the machine learning model outputs the recognition result. The recognition result may be that the inputted corpus is the click-through bait title corpus, or that the inputted corpus is not the click-through bait title corpus. Dividing the number of the recognition results which are recognized as the click bait heading corpus by the total number of the recognition results to obtain the ratio of the number of the corpus samples which are recognized as the click bait heading corpus by the machine learning model to the number of the corpus samples in the corpus sample set.

For example, for the aforementioned corpus sample set with a capacity of 10000, if 107 corpus samples are identified as the click bait title corpus, the percentage is determined to be 1.07%.

In one embodiment, the predetermined condition includes: the determined ratio falls within an interval around the predetermined ratio. Specifically, the predetermined occupation ratio β is determined according to the following formula:

β＝(M+N·α)/(M+N)，

Since N is the number of the random corpus samples, α is the probability that the heading corpus is the click-bait heading corpus counted in advance, N · α is the number of the predicted click-bait heading corpus samples in the random corpus samples, and M corpus samples are already identified as the click-bait heading corpus samples, M + N · α is the predicted number of the bait heading corpus samples in the corpus sample set, and the predicted number-to-number ratio of the bait heading corpus samples in the corpus sample set is obtained by dividing the number of the samples in the corpus sample set by the total number M + N of the samples in the corpus sample set.

In one embodiment, the determination method of α may be, for example: a large amount of corpora with titles are crawled from the Internet in advance, and whether the corpora are the titles of the clicked decoy or not is manually identified one by one. Then, dividing the number of the linguistic data identified as the click bait title linguistic data by the number of all the crawled linguistic data to obtain a ratio which is alpha.

In another embodiment, α may be requested from other applications or devices that have obtained α.

For example, if α is 0.1%, when M is 100 and N is 9900, β is (100+9900 · 0.1%)/(100 +9900) · is 1.10%.

Therefore, as long as the determined ratio falls within an interval around the predetermined ratio, the recognition result of the machine learning model can be considered to be relatively close to the statistical result. In this case, the serious sample bias is basically eliminated (generally, the click decoy heading corpus sample is small in the total random corpus samples, but in the extreme, there is a possibility that a large number of click decoy heading corpus samples happen to appear in the random corpus samples, i.e., the serious sample bias). If the determined ratio falls within an interval around the predetermined ratio, the recognition result of the machine learning model can be considered to be closer to the statistical result, indicating that there is no sample significant deviation.

In one embodiment, the interval around the predetermined ratio includes an interval whose endpoints are the predetermined ratio minus a specific value, and the predetermined ratio plus a specific value. The interval may be an open interval or a closed interval. For example, the specific value is 0.05%, and in the case where β is 1.10%, the interval around the predetermined proportion may be (1.05%, 1.15%) or [ 1.05%, 1.15% ]. If the percentage is 1.07%, falling within this interval, the training of the machine learning model may be stopped.

In another embodiment, the interval around the predetermined ratio may also include an interval in which the end points are the predetermined ratio minus a first value and the predetermined ratio plus a second value, where the first value and the second value are different. In addition, the first value is 0.04%, the second value is 0.06, and in the case where β is 1.10%, the interval around the predetermined ratio may be (1.06%, 1.16%) or [ 1.06%, 1.16% ].

In step 130, it is determined to stop training the machine learning model if the determined ratio satisfies a predetermined condition.

If the determined ratio satisfies the predetermined condition, it indicates that the sample has no serious bias (generally, the click-through decoy corpus sample is small in the total stochastic corpus sample, but in the extreme, there is a possibility that a large number of click-through decoy corpus samples happen to appear in the stochastic corpus sample, i.e., the serious bias). The proportion of the word material sample identified as the click bait title meets the preset condition through control, so that the occurrence of serious deviation is prevented, and the normal model training effect is ensured while the labeling cost is reduced.

As shown in fig. 1. In one embodiment, the method further comprises: and 131, if the determined proportion does not meet the preset condition, reconstructing the corpus sample set and inputting the corpus sample set into the machine learning model until the determined proportion meets the preset condition.

The reconstructed corpus sample set mainly refers to the replacement of random corpus samples in the corpus sample set. Because the corpus samples in the corpus sample set that have been identified as the click-through bait title corpus are well-identified, re-identification and labeling can increase labeling burden. In addition, the predetermined condition is not satisfied, probably because the corpus samples crawled randomly do not have generality (for example, the corpus of the click bait header happens to be too much), and therefore, the random corpus samples in the corpus sample set are replaced, which is helpful for training the machine learning model more objectively.

In addition, as shown in fig. 2, in an embodiment, the step 130 specifically includes:

step 1301, if the determined proportion meets a preset condition, inputting a plurality of test corpora into the machine learning model, and outputting the recognition results of the plurality of test corpora by the machine learning model;

step 1302, receiving a determination of the correctness of each recognition result;

and step 1303, judging to stop training the machine learning model according to the judgment result of the correctness.

The test corpus is a corpus used for testing the training effect of the machine learning model, and is different from the corpus samples in the corpus sample set. In one embodiment, the test corpus may also be crawled randomly from the Internet.

After the plurality of test corpora are input into the machine learning model, the machine learning model outputs recognition results of the plurality of test corpora.

In step 1302, according to one embodiment, the machine learning model recognition result for each corpus may be displayed on a display interface, and the expert determines whether the machine learning model recognition result is correct. Then, an input of an expert on a display interface is received, the input indicating a determination of correctness of each recognition result.

In step 1303, a threshold for accuracy may be predetermined, according to one embodiment. And if the ratio of the correct recognition result to the total number of the recognition results exceeds a preset correct rate threshold value, judging to stop training the machine learning model. For example, the predetermined accuracy threshold is 90%, the number of test corpora is 50, 46 of the 50 corresponding recognition results are determined to be correct, 4 of the errors are determined, the ratio of the correct recognition results to the total number of the recognition results is 92%, and the training of the machine learning model can be stopped when the ratio is higher than the predetermined accuracy threshold 90%.

A benefit of this embodiment is that the training of the machine learning model is not stopped immediately after the determined proportion meets the predetermined condition, but rather the test is continued, and the training of the machine learning model is stopped in case the test passes, otherwise the training of the machine learning model is continued. Therefore, the quality and the recognition accuracy of the trained machine learning model are improved.

As shown in fig. 3, after step 110, the method may include, according to one embodiment of the invention: and step 115, configuring the machine learning model to extract the characteristics of the corpus samples in the corpus sample set, wherein the machine learning model identifies the clicked decoy title corpus based on the characteristics.

Features in the field of machine learning refer to elements extracted from an input sample that have an effect on the results of machine learning model output or content derived from elements. And (3) extracting features from the sample by the machine learning model, and operating the features through a constructed objective function to obtain a recognition result. The essence of machine learning is that, based on the extracted features and the known recognition results, the coefficients of the objective function can be determined by learning. Therefore, after the corpus with the title is input into the trained machine learning model, the characteristics are extracted from the input corpus, and the machine learning model outputs the recognition result through the target function operation determined by the coefficient.

In general, the goal of feature selection is roughly as follows: the accuracy of prediction is improved; a prediction model with faster construction and lower consumption is constructed; the model can be better understood and interpreted. The embodiment of the invention extracts word characteristics, semantic characteristics and syntactic characteristics from the sample, extracts the characteristics from the aspects of words, semantics and syntax, judges the corpus sample from all aspects and relieves the problem of low identification accuracy caused by single characteristic extraction.

In one embodiment, the features include word features. Word features are features that consist of words. Step 115 specifically includes: and taking words in the title of the corpus sample as features.

In one embodiment, each word in the title of the corpus sample is taken as a feature. For example, the heading "aluminum is an important metal", "aluminum", "is", "one", "important", "of", "metal" are all features. The segmentation of the title into words can be performed by existing segmentation techniques or segmentation systems.

In another embodiment, after the title of the corpus sample is divided into words, part of the words in the divided words are selected as features according to a preset standard. In one embodiment, the predetermined criteria is, for example, nouns and verbs as the characteristics. In the example of the heading "aluminum is an important metal," the heading is divided into the words "aluminum," is, "" a, "" an, "" important, "" of, "and" metal, "where the noun or verb is" aluminum, "" is, "" metal. "aluminum", "is", "metal" are taken as characteristics.

In one embodiment, the features include word features. Step 115 specifically includes: and combining the words in the title of the corpus sample with the part of speech of the words as features. In the example of the heading "aluminum is an important metal", the heading is divided into words "aluminum", "is", "one", "important", "of", "metal", and "aluminum + noun", "is + verb", "a + number", "a + quantifier", "important + adjective", "a + co-verb", "metal + noun" are used as the features. Because different parts of speech of the same word have great influence on the interpretation of the meaning of the whole sentence, the meaning of the word can be more accurately defined by adopting the word + the part of speech, and the optimization of the recognition result of the machine learning model is more favorable.

In one embodiment, the features include semantic features. As shown in fig. 4, step 115 specifically includes:

step 1151, synthesizing word vectors corresponding to the words into which the titles of the corpus samples are divided into title semantic vectors;

step 1152, calculating a hash vector of the title semantic vector;

step 1153, encoding the hash vector into a sparsely encoded vector with a fixed element number;

step 1154, determining the element position as the feature based on the sparsely encoded vector.

These steps are described in detail below.

In step 1151, word vectors corresponding to the words into which the title of the corpus sample is divided are synthesized into title semantic vectors.

In one embodiment, step 1151 includes:

segmenting the title of the corpus sample;

determining word vectors corresponding to the divided words;

and synthesizing the determined word vectors into a title semantic vector.

The word segmentation method or the word segmentation system which is available at present can be adopted to segment the title of the corpus sample into words. In the example of the title "aluminum is an important metal" as described above, the title is divided into the words "aluminum", "is", "a", "an", "important", "of" and "metal".

When a natural language on a web page is handed over to an algorithm in a machine learning model for processing, the language usually needs to be first mathematized, and a word vector is an expression for mathemating words in the natural language. Each word in a natural language (e.g., chinese) may be mapped to a fixed-length vector, and all of the vectors may be put together to form a word vector space. Each word has a corresponding word vector in the word vector space. The word vectors for different words are different. In the word vector space, the closer the words are, the closer their parts of speech and semantics are. Words of different parts of speech are far apart in the word vector space. Words of the same part of speech but with more different meanings are also at a distance in the word vector space. Words of the same part of speech and similar meaning are very close in word vector space. If the word vector space is compared with a plane rectangular coordinate system, the word vectors can be compared with coordinates in the coordinate system, and the smaller the distance between the coordinates, the closer the part of speech and the semantic meaning of the representative word vectors are. In one embodiment, determining the word vectors corresponding to the divided words may be accomplished by training a neural network and inputting the words into the neural network.

A title semantic vector is a vector representing the semantics of the entire title, which is generated based on the vector of speech for each word in the title.

In one embodiment, synthesizing the determined word vectors into a heading semantic vector may be accomplished by concatenating the determined word vectors in order of occurrence of the words in the heading. For example, the word vector for "aluminum" (25,34,8,158,3), the word vector for "yes" is (34,101,89,2,121), and the word vector for "metal" is (57,9,91,46, 201). These word vectors are continued in the order of "aluminum", "is", "metal" to (25,34,8,158,3, 34,101,89,2,121, 57,9,91,46, 201). If each word vector has a elements and b word vectors, the succeeding title semantic vector has ab elements.

In another embodiment, synthesizing the determined word vectors into a title semantic vector may be accomplished by interleaving the word vectors. That is, the first element of each word vector is taken out and arranged into a first element series according to the appearance sequence of each word vector in the title; then taking out the second element of each word vector, and arranging the second element series after the first element series according to the appearance sequence of each word vector in the title; and taking out the third element of each word vector, and arranging the third elements after the second element series according to the appearance sequence of each word vector in the title. And so on. For example, the word vector for "aluminum" (25,34,8,158,3), the word vector for "yes" is (34,101,89,2,121), the word vector for "metal" is (57,9,91,46,201), and the synthesized title semantic vector is (25,34,57,34,101,9,8,89,91,158,2,46,3,121, 201).

In step 1152, a hash vector of the title semantic vector is calculated.

The hash vector is a vector obtained by applying a hash algorithm to the title semantic vector. A hash algorithm is an algorithm that extracts features from a large string that can represent the large string, thereby generating a shorter string. The short character string still can distinguish the huge character string to the maximum extent due to the extracted features. A title semantic vector is treated as a huge string of characters. The hash algorithm is applied to the title semantic vector, so that the hash vector can be obtained, and the features of the title semantic vector are reflected in a centralized manner.

In step 1153, the hash vector is encoded into a fixed element number of sparsely encoded vectors.

Sparse coding is an artificial neural network method for simulating simple cell perception in the V1 region of the main visual cortex of a mammalian visual system. The method has spatial locality, directivity and band-pass property of frequency domain, and is a self-adaptive image statistical method. The sparse coding has the advantages of large storage capacity, simple and convenient calculation and no loss of original characteristics. Because the number of words divided by different scales is different, the number of elements of the synthesized title semantic vector is different, and the number of elements of the hash vector after hash operation is different, so that comparison among different titles is inconvenient. Therefore, the hash vector is encoded into a fixed-element sparsely encoded vector, which is a fixed-element number, i.e., a fixed length, without losing the characteristics of the original hash vector, and is convenient for the horizontal comparison between headers.

In step 1154, based on the sparsely encoded vector, an element position as the feature is determined.

In one embodiment, the element position n before the element value in the sparse coded vector is determined, and n is a predetermined positive integer as the feature. A special case when n is 1 in the above case is that the element position with the largest element value in the sparsely encoded vector is determined as the feature.

Element position refers to the position of an element in the sparsely encoded vector, e.g., the second element of the vector.

For example, if the sparsely encoded vector is (122,48,86,3,88,4), n is 2, and the maximum of the 6 elements 122,48,86,3,88,4 is 122 and 88, then positions 1 and 5 (the first element and the fifth element of the vector) are the determined element positions as the features.

In the aspect of determining the semantic features extracted by the machine learning model, the embodiment of the invention adopts a special semantic feature, namely the element positions in the vector after sparse coding. The element positions are obtained as follows: synthesizing word vectors corresponding to words into which the title of the corpus sample is divided into title semantic vectors, calculating hash vectors of the title semantic vectors, encoding the hash vectors into sparsely encoded vectors with fixed element numbers, and determining element positions serving as the features based on the sparsely encoded vectors. Because the title semantic vector is synthesized by the word vectors of the words into which the title is divided, the characteristics of the words in the sentence are embodied in a concentrated manner. The sparse coding keeps the characteristics, and simultaneously, the number of coded elements is the same, so that the coded vectors can be compared with each other, and the positions of the elements extracted from the vectors are concentrated to embody the overall characteristics of sentences. By using the features to represent the semantics of the sentences, the accuracy of machine learning model recognition can be improved.

In one embodiment, the features include syntactic features. As shown in fig. 5, step 115 may include:

step 1151', constructing a syntax tree of a title of the corpus sample;

1152', synthesizing a part-of-speech node of the syntax tree or all word nodes under the part-of-speech node of the syntax tree into an extracted phrase;

step 1153', the extracted phrase is taken as the feature.

These steps are described in detail below.

In step 1151', a syntax tree for the title of the corpus sample is constructed.

The syntax tree refers to a treemap for representing the syntax of a sentence, i.e., the relationship between sentence components. One method of generating a syntax tree is: decomposing a sentence into words, and taking the divided words as the bottommost node 501 of the syntactic number; the part of speech corresponding to the word is used as a top level node 502 of the word; combining words with close syntactic relations in adjacent words into phrases, and constructing phrase type nodes 503 of the phrases as the previous layer nodes of the part of speech nodes of the words; and combining adjacent words with close syntactic relation in the combined phrases and the uncombined words, and combining the words into phrases of a higher layer corresponding to nodes of the higher layer. And so on, as shown in fig. 9, until the whole sentence is generated as the top node of a phrase.

Part of speech is the nature of a word, such as a noun, verb, etc. Phrase property is the property of a phrase, such as a noun phrase, a verb phrase, etc. In the syntax tree, S denotes a sentence, NP denotes a noun phrase, VP denotes a verb phrase, AP denotes an adjective phrase, NUMER denotes a number phrase, which all denote a phrase property. N denotes nouns, V denotes verbs, CARD denotes a radix word, QTF denotes quantifiers, ADJ denotes adjectives, PART denotes co-words, all of which denote PARTs of speech.

As shown in fig. 9, adjacencies of the words "important" and "are ADJ, PART. ADJ and PART are respectively used as the upper layer nodes of 'important' and 'important'. The syntax of "important" and "of" is closely related, so it is synthesized into the phrase "important", whose phrase property is AP, which is taken as the previous node of ADJ, PART. The adjacent phrases "important" and "metal" have close syntactic relations, and are synthesized into the phrase "important metal", the phrase property of which is NP, until the phrase node S of the whole sentence at the uppermost layer is synthesized.

In step 1152', one part-of-speech node or all the word nodes below the part-of-speech node of the syntax tree are synthesized into the extracted phrase.

The term is used herein in a broad sense to refer to words in a sentence or semantic units assembled from words in a sentence.

Part-of-speech nodes are nodes in the syntax tree that represent parts-of-speech of words, such as ADJ in fig. 9. The phrase node is a node in the syntax tree that represents the phrase property of the phrase, such as the AP in fig. 9. All words under any part-of-speech node or phrase-type node in the syntax tree may form an extracted phrase. For part-of-speech nodes, each of the segmented words may be an extracted phrase because they correspond one-to-one to the segmented words. For a phrase node, there may be a part-of-speech node, a phrase node, and a word node below it, and words of the word node below it are extracted to synthesize an extracted phrase. As shown in fig. 9, for the part of speech node ADJ, the word node "important" below it is an extracted phrase; for phrase-based node AP, the word nodes "important" and "under" are combined into an extracted phrase "important"; for the phrase node NP at the upper layer of the AP, the word nodes "important", "of" and "metal" below the phrase node NP are combined into an extracted phrase "important metal".

In step 1153', the extracted phrase is used as the feature.

In one embodiment, all the extracted phrases are used as features. In another embodiment, only phrases in which the frequency of occurrence is high may also be selected as features. That is, the number of occurrences of the extracted phrase in the corpus is determined, and the phrase whose number of occurrences exceeds a predetermined number threshold is used as the feature.

The corpus may be different from the corpus sample set, but a database storing several corpuses is additionally established. For example, a large amount of corpora is crawled from the internet and placed into a corpus. Searching the extracted phrases in the corpus to obtain the occurrence frequency of the phrases in the corpus, taking the phrases with the occurrence frequency exceeding a preset frequency threshold as features, and discarding the phrases with the occurrence frequency lower than the frequency threshold.

The advantage of using a syntax tree to extract syntactic features is that the syntax tree can accurately reflect the degree of closeness between adjacent words outside the syntactic component in a sentence. The syntactic characteristics are selected by utilizing the syntactic tree, compared with a mode that adjacent words are randomly combined to form characteristics in a sentence, the extracted characteristics can effectively represent the syntactic relation of the words in the sentence, and the identification accuracy is improved.

As shown in fig. 6, according to an embodiment of the present invention, there is further provided a click-to-bait title corpus identifying method, including:

step 210, training a machine learning model according to the method for training the click bait title corpus recognition model described above with reference to fig. 1-5;

and step 220, inputting the linguistic data to be recognized into the machine learning model to obtain a recognition result of the keywords of the clicked bait.

The corpus to be identified is the corpus to be identified whether the corpus is the click bait title corpus. And inputting the linguistic data to be recognized into the machine learning model, wherein the machine learning model can output a recognition result of whether the linguistic data to be recognized is the click bait title linguistic data.

As shown in fig. 7, the apparatus for training a click-through bait heading corpus recognition model according to an embodiment of the present invention includes:

In one embodiment, the predetermined proportion threshold is 10%.

In one embodiment, the predetermined condition includes:

the determined ratio falls within an interval around the predetermined ratio.

In one embodiment, the interval around the predetermined ratio includes an interval whose endpoints are the predetermined ratio minus a specific value, and the predetermined ratio plus a specific value.

In one embodiment, the apparatus further comprises:

and a reconstructing unit (not shown) configured to, if the determined proportion does not satisfy the predetermined condition, reconstruct the corpus sample set input to the machine learning model until the determined proportion satisfies the predetermined condition.

β＝(M+N·α)/(M+N)，

In one embodiment, the stop training determination unit 730 is further configured to:

receiving a determination of the correctness of each recognition result;

In one embodiment, the apparatus further comprises:

a configuration unit (not shown) configured to configure the machine learning model to extract features of the corpus samples in the corpus sample set, wherein the machine learning model identifies click bait title corpus based on the features.

In one embodiment, the features include word features. The configuration unit is further configured to:

and taking words in the title of the corpus sample as features.

and combining the words in the title of the corpus sample with the part of speech of the words as features.

In one embodiment, the features include semantic features. The configuration unit is further configured to:

calculating a hash vector of the title semantic vector;

In an embodiment, the determining the element position as the feature based on the sparsely encoded vector specifically includes:

and determining the element positions n before the element values in the sparse coding vector as the features, wherein n is a preset positive integer.

segmenting the title of the corpus sample;

determining word vectors corresponding to the divided words;

and synthesizing the determined word vectors into a title semantic vector.

In one embodiment, the features include syntactic features. The configuration unit is further configured to:

constructing a syntax tree of a title of the corpus sample;

and taking the extracted phrase as the feature.

determining the occurrence times of the extracted phrases in the corpus;

As shown in fig. 8, according to an embodiment of the present invention, there is also provided a click-through bait title corpus identifying apparatus including:

a click bait heading corpus recognition model training device 810 for training a machine learning model according to the method described above;

and a click bait heading corpus identifying unit 820, configured to input the corpus to be identified into the machine learning model, so as to obtain a click bait heading corpus identifying result.

The click-through decoy heading corpus recognition model training apparatus or the click-through decoy heading corpus recognition apparatus 800 according to this embodiment of the present invention will be described with reference to fig. 10. The click-to-bait corpus recognition model training apparatus or the click-to-bait corpus recognition apparatus 800 shown in fig. 10 is only an example, and should not bring any limitations to the functions and the scope of the embodiments of the present invention.

As shown in FIG. 6, the click bait heading corpus recognition model training apparatus or the click bait heading corpus recognition apparatus 800 is embodied in the form of a general purpose computing device. Components of the click-to-bait corpus recognition model training device or the click-to-bait corpus recognition device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 that couples the various system components including the memory unit 820 and the processing unit 810.

Wherein the storage unit stores program code that can be executed by the processing unit 810, such that the processing unit 810 performs the steps according to various exemplary embodiments of the present invention described in the description part of the above exemplary methods of the present specification. For example, the processing unit 810 may perform various steps as shown in fig. 1.

The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The click bait heading corpus recognition model training device or the click bait heading corpus recognition device 800 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the click bait heading corpus recognition model training device or the click bait heading corpus recognition device 800, and/or with any devices (e.g., router, modem, etc.) that enable the click bait heading corpus recognition model training device or the click bait heading corpus recognition device 800 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the click decoy corpus recognition model training device or the click decoy corpus recognition device 800 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the click bait heading corpus recognition model training device or other modules of the click bait heading corpus recognition device 800 via a bus 830. It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with the click bait corpus recognition model training device or the click bait corpus recognition device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present invention.

In an exemplary embodiment of the present invention, there is also provided a computer program medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the method described in the above method embodiment.

According to an embodiment of the present invention, there is also provided a program product for implementing the method in the above method embodiment, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. A training method of a click bait title corpus recognition model is characterized by comprising the following steps:

receiving a determination of the correctness of each recognition result;

2. The method of claim 1, wherein the predetermined proportion threshold is 10%.

3. The method of claim 1, wherein the predetermined condition comprises:

the determined ratio falls within an interval around the predetermined ratio.

4. The method according to claim 3, wherein the interval around the predetermined ratio includes an interval whose endpoints are the predetermined ratio minus a specific value, and the predetermined ratio plus a specific value.

5. The method of claim 1, further comprising:

6. The method of claim 3, wherein the predetermined condition comprises:

the determined ratio falls within an interval around a predetermined ratio, said predetermined ratio β being determined according to the following formula:

β＝(M+N·α)/(M+N)，

7. The method according to claim 1, wherein determining to stop training the machine learning model according to the result of the correctness determination comprises:

8. The method of claim 1, wherein after inputting the corpus sample set into the machine learning model, the method further comprises:

9. The method of claim 8, wherein the feature comprises a word feature,

the configuring the machine learning model to extract the features of the corpus samples in the corpus sample set specifically includes:

and taking words in the title of the corpus sample as features.

10. The method of claim 8, wherein the feature comprises a word feature,

11. The method of claim 8, wherein the features comprise semantic features,

calculating a hash vector of the title semantic vector;

12. The method according to claim 11, wherein the determining element positions as the features based on the sparsely encoded vectors comprises:

13. The method according to claim 11, wherein synthesizing the word vectors corresponding to the words into which the titles of the corpus samples are divided into the title semantic vectors specifically comprises:

segmenting the title of the corpus sample;

determining word vectors corresponding to the divided words;

and synthesizing the determined word vectors into a title semantic vector.

14. The method of claim 8, the features comprising syntactic features,

constructing a syntax tree of a title of the corpus sample;

and taking the extracted phrase as the feature.

15. The method according to claim 14, wherein the using the extracted phrase as the feature specifically includes:

determining the occurrence times of the extracted phrases in the corpus;

16. A method for identifying a click-to-bait heading corpus, comprising:

the method of any one of claims 1-15, training a machine learning model;

17. A training device for a click bait title corpus recognition model, comprising:

a training stopping judgment unit, configured to input a plurality of test corpora into the machine learning model if the determined proportion satisfies a predetermined condition, where the machine learning model outputs a recognition result of the plurality of test corpora; receiving a determination of the correctness of each recognition result; and judging to stop training the machine learning model according to the judgment result of the correctness.

18. A click-to-bait title corpus identification device, comprising:

the click bait heading corpus training device of claim 17, training a machine learning model;

19. A training device for a click bait title corpus recognition model, comprising:

a memory storing computer readable instructions;

a processor reading computer readable instructions stored by the memory to perform the method of any of claims 1-15.

20. A click-to-bait title corpus identification device, comprising:

a memory storing computer readable instructions;

a processor that reads computer readable instructions stored by the memory to perform the method of claim 16.

21. A computer program medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the click bait heading corpus recognition model training method of any one of claims 1-15.

22. A computer program medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the click bait title corpus identification method of claim 16.