CN107609094B

CN107609094B - Data disambiguation method and device and computer equipment

Info

Publication number: CN107609094B
Application number: CN201710807103.2A
Authority: CN
Inventors: 刘琼琼
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2020-12-04
Anticipated expiration: 2037-09-08
Also published as: CN107609094A

Abstract

The invention provides a data disambiguation method, a device and computer equipment, wherein the method comprises the steps of labeling each piece of data in training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified; and determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, and training labels corresponding to each piece of first data and each piece of second data according to the first feature and the second feature. The method and the device can realize deep mining of data of the user click behavior log, extract the referenceable data in the data for analysis, and combine scenes in multiple directions, thereby greatly improving the data disambiguation precision, reducing the time and cost of data disambiguation, and realizing cost reduction and improving the automatic disambiguation effect of data disambiguation.

Description

Data disambiguation method and device and computer equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data disambiguation method, apparatus, and computer device.

Background

In the related art, generally, a machine learning method and a dictionary are adopted to disambiguate the categories of data, or a named entity recognition technology is adopted to recognize the categories such as names of people, places, and organizations.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, one objective of the present invention is to provide a data disambiguation method, which can implement deep mining on data of a user click behavior log, extract referenceable data therein for analysis, combine scenes in multiple directions, greatly improve data disambiguation accuracy, reduce time and cost of data disambiguation, and implement cost reduction and an automatic disambiguation effect of improving data disambiguation.

Another object of the present invention is to provide a data disambiguation apparatus.

Another object of the invention is to propose a computer device.

It is another object of the invention to propose a non-transitory computer-readable storage medium.

It is a further object of the invention to propose a computer program product.

In order to achieve the above object, an embodiment of the first aspect of the present invention provides a data disambiguation method, including: constructing training data; labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified; determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first feature and the second feature comprising: literal features and user behavior features; and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

In the data disambiguation method provided in the embodiment of the first aspect of the present invention, training data is constructed, each piece of data in the training data is labeled based on a category to be classified, a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified are obtained, a feature related to each piece of first data is determined based on a user click behavior log and is used as a first feature, and a feature related to each piece of second data is used as a second feature, and the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

In order to achieve the above object, a data disambiguation apparatus according to an embodiment of the second aspect of the invention includes: a construction module for constructing training data; the marking module is used for marking each piece of data in the training data based on the category to be classified to obtain a plurality of pieces of first data marked as belonging to the category to be classified and a plurality of pieces of second data marked as not belonging to the category to be classified; a feature determination module, configured to determine, based on the user click behavior log, a feature related to each piece of first data as a first feature, and a feature related to each piece of second data as a second feature, where the first feature and the second feature include: literal features and user behavior features; and the training module is used for training the labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

In the data disambiguation apparatus according to the embodiment of the second aspect of the present invention, training data is constructed, each piece of data in the training data is labeled based on a category to be classified, a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified are obtained, a feature associated with each piece of first data is determined based on a user click behavior log and is used as a first feature, and a feature associated with each piece of second data is used as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

To achieve the above object, a computer device according to a third embodiment of the present invention includes: a processor, a memory, a power circuit, a multimedia component, an audio component, an interface for input and output (I and O), a sensor component, and a communication component; wherein, the circuit board is arranged in the space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: constructing training data; labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified; determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first feature and the second feature comprising: literal features and user behavior features; and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

In the computer device according to the third aspect of the present invention, training data is constructed, each piece of data in the training data is labeled based on a category to be classified, a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified are obtained, a feature related to each piece of first data is determined based on a user click behavior log and is used as a first feature, and a feature related to each piece of second data is used as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth aspect of the present invention, and when instructions in the storage medium are executed by a processor of a mobile terminal, the instructions enable the mobile terminal to execute a data disambiguation method, the method comprising: constructing training data; labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified; determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first feature and the second feature comprising: literal features and user behavior features; and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

A non-transitory computer-readable storage medium according to an embodiment of a fourth aspect of the present invention is a non-transitory computer-readable storage medium that is configured to construct training data, label each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified, determine, based on a user click behavior log, a feature associated with each piece of first data and serve as a first feature, and a feature associated with each piece of second data and serve as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

To achieve the above object, a computer program product according to a fifth embodiment of the present invention executes a data disambiguation method when instructions of the computer program product are executed by a processor, the method comprising: constructing training data; labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified; determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first feature and the second feature comprising: literal features and user behavior features; and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

In a computer program product according to an embodiment of a fifth aspect of the present invention, training data is constructed, each piece of data in the training data is labeled based on a category to be classified, so as to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified, a feature related to each piece of first data is determined based on a user click behavior log and is used as a first feature, and a feature related to each piece of second data is used as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart illustrating a data disambiguation method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a data disambiguation method according to another embodiment of the present invention;

FIG. 3 is a flow chart illustrating a data disambiguation method according to another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a data disambiguation apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data disambiguation apparatus according to another embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic flow chart of a data disambiguation method according to an embodiment of the present invention.

Referring to fig. 1, the method includes:

s11: training data is constructed.

In the embodiment of the present invention, the training data is used as a proper name for example, which is not limited to this.

The proper name may be, for example, a jazz, baby, etc.

Wherein commonly used proper names (e.g., jazz, baby) can be collected to obtain training data. The training data refers to a special name before training, and is usually a word or a phrase, such as jazz, baby, Beijing university, or warm wind in summer, etc.

S12: and labeling each piece of data in the training data based on the category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified.

The category to be classified is a category to which a piece of training data may belong, for example, for "jazz" of training data, the category to be classified may be, for example, "video", and this is not limited.

In the embodiment of the present invention, a structure training data may be constructed, and each piece of data in the training data may be labeled based on the category to be classified, specifically, the category of each piece of data may be initially labeled based on an actual labeling experience, or the category of each piece of data may be initially labeled based on a named entity recognition technology, and then, the label of each piece of data is triggered to be trained to disambiguate the training data.

For example, if the category to be classified is determined to be a video, the category of "trace" in the training data is initially labeled, and the obtained label is that "trace" belongs to the category to be classified "video"; carrying out initial labeling on the category of the baby in the training data to obtain a label that the baby does not belong to the category to be classified, namely video; the method comprises the steps of initially marking the category of Beijing university in training data, marking that the obtained category of Beijing university does not belong to the category video to be classified, initially marking the category of warm air in summer in the training data, marking that the obtained category of warm air in summer belongs to the category video to be classified, and the like.

Furthermore, after each piece of data in the training data is labeled based on the category to be classified, a plurality of pieces of training data belonging to the category to be classified are used as first data, and a plurality of pieces of training data not belonging to the category to be classified are used as second data.

For example, training data "jazz" and "warm wind in summer" are taken as the first data, and training data "baby" and "university of beijing" are taken as the second data.

S13: determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first and second features comprising: literal features and user behavior features.

The user click behavior log is automatically generated by a background server system of the electronic equipment according to some actual application scenes.

Some internet-based clicking behaviors of the user can be recorded in the user clicking behavior log, for example, clicking behaviors of a website URL, clicking behaviors of a picture, clicking behaviors of a hyperlink, and the like.

In the embodiment of the invention, the user click behavior log records some click behaviors of the user based on the Internet, and the user click behavior log is automatically generated according to some actual application scenes, so that the labels corresponding to the training data are trained by combining the user click behavior log, the classification categories to which the training data belong can be identified by combining the scenes, and the comprehension of the classification categories to which the training data belong is realized.

The face features may be, for example, word segmentation results and word segmentation number of the first data "what is called" and the face features of different training data may be different or the same.

The user behavior feature may be, for example, a click feature, a search feature, a show feature, and the like, without limitation.

In the embodiment of the invention, a word segmentation device based on a dictionary matching algorithm or a word segmentation device based on a learning algorithm and the like can be adopted to determine the character surface characteristics corresponding to each piece of training data.

In the embodiment of the present invention, in order to train a label corresponding to training data in combination with a user click behavior log, a feature related to each piece of first data may be determined based on the user click behavior log and used as a first feature, and a feature related to each piece of second data may be used as a second feature, that is, a plurality of universal website URLs matching each piece of training data may be first obtained from a search engine, specifically, the training data may be input as a search word into a search box of the search engine and trigger a search, the plurality of universal website URLs are obtained from an obtained search result, and then, the number of times that the user clicks each universal website URL is counted from the user click behavior log as a click feature corresponding to the training data, that is, the user behavior feature.

Alternatively, the number of times that the user searches each universal website URL may be counted from the user behavior log as the search feature corresponding to the piece of training data.

Or, the number of times that each universal website URL is presented by the application in the internet may be counted from the user behavior log, and the counted number is used as the presentation feature corresponding to the piece of training data, which is not limited.

Optionally, in some embodiments, referring to fig. 2, S13 may include:

s201: and respectively determining the length characteristic of each piece of first data and each piece of second data.

The length characteristic may be, for example, the number of characters occupied by each piece of first data and each piece of second data, which is not limited.

S202: and performing word segmentation on each piece of first data and each piece of second data respectively to obtain word segmentation results, and taking the length characteristics and the word segmentation results as literal characteristics.

S203: determining category keywords belonging to a category to be classified from a preset category keyword library, and generating a first keyword set according to the category keywords belonging to the category to be classified.

The preset category keyword library may be established in a big data statistics manner, for example, the preset category keyword library may be specifically established in a statistical manner, for example, a backstage person may perform statistics on a search behavior of a user on a search engine, and store a keyword which may belong to a category to be classified and a keyword which may not belong to the category to be classified, which are frequently searched, in the preset category keyword library as a search result. Alternatively, a preset category keyword library may also be established in a machine learning manner, for example, keywords which are obtained from a webpage by a webpage related technology such as a crawler technology and have a large number of user searches, may belong to a category to be classified, and keywords which may not belong to the category to be classified are stored in the preset category keyword library, which is not limited herein.

S204: and determining category keywords which do not belong to the category to be classified from a preset category keyword library, and generating a second keyword set according to the keywords which do not belong to the category to be classified.

S205: determining a category url belonging to a category to be classified from a user click behavior log, and generating a first url set according to the category url belonging to the category to be classified.

For example, the category to be classified may be input as a search word into a search box of a search engine, a search is triggered, a plurality of general website URLs are obtained from the obtained search result, and then, according to the user click behavior log, a first URL set is generated according to a category URL actually linked to the category to be classified in the plurality of general website URLs, that is, assuming that the category to be classified is "video", a plurality of categories URL actually linked to "video" are determined from a plurality of general website URLs corresponding to "video", and a first URL set is generated according to the plurality of categories URL.

S206: and determining a category URL which does not belong to the category to be classified according to the universal negative URL case set, and generating a second URL set according to the category URL which does not belong to the category to be classified.

Optionally, the universal URL negative example set may be generated in advance, and specifically, the universal URL negative example set may be generated according to a universal website URL, where the universal URL negative example is: the user clicks on a URL other than the universal website URLs matching the category to be classified and each piece of the first data and each piece of the second data among the universal website URLs.

For example, the second URL set may be generated from a universal URL negative example set, in which a category URL that does not actually belong to the category link to be classified is determined among the universal URL negative examples, that is, assuming that the category to be classified is "video", a plurality of categories URL that do not actually belong to the "video" link are determined from the universal URL negative example set, and the second URL set may be generated according to the categories URL.

S207: and taking the first keyword set and the first url set as first related recommendations corresponding to the first data, and taking the second keyword set and the second url set as second related recommendations corresponding to the second data.

In the embodiment of the invention, the first keyword set and the first url set are simultaneously used as the first relevant recommendation corresponding to the first data, the second keyword set and the second url set are used as the second relevant recommendation corresponding to the second data, the classification category to which the training data belongs can be identified by combining scenes and the literal characteristics of the training data, the comprehension of the identification of the classification category to which the training data belongs is realized, and the identification accuracy is improved.

S208: according to the user click behavior log, determining the first times of clicking the first relevant recommendation by the user, searching the first times of the first relevant recommendation by the user, wherein the first times of category keywords in the first keyword set are contained in the title corresponding to the website URL in the first URL set, and taking the first times as click characteristics corresponding to the first data.

S209: and determining a second frequency of clicking the second relevant recommendation by the user according to the user clicking behavior log, searching the second frequency of the second relevant recommendation by the user, wherein the second frequency of the category keywords in the second keyword set is contained in the title corresponding to the website URL in the second URL set, and taking the second frequency as a clicking characteristic corresponding to the second data.

S210: and taking the literal feature and the click feature of each piece of first data as the corresponding first features, and taking the literal feature and the click feature of each piece of second data as the corresponding second features.

The embodiment of the invention provides a method for determining literal characteristics and user behavior characteristics related to each piece of training data based on a user click behavior log, the method considers reference categories more completely, not only combines the click characteristics determined by the user click behavior log, but also combines keywords in a preset category keyword library, determines the times of clicking related recommendations by the user according to the user click behavior log, and the times of searching the related recommendations by the user, in titles corresponding to website URLs in a first URL set and a second URL set, the times of including category keywords in the first keyword set and the second keyword set are taken as click characteristics corresponding to the training data, and reference data in the titles are extracted for analysis by deeply mining the data of the user click behavior log, so that scenes can be combined in multiple directions, the special name recognition accuracy is greatly improved.

S14: and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

Optionally, in some embodiments, referring to fig. 3, S14 may include:

s301: and generating a first candidate URL set according to the click characteristics of the first data, and generating a second candidate URL set according to the click characteristics of the second data.

Wherein, the first set of candidate URLs may include: the first data includes a plurality of common site URLs corresponding to the first data, and the common site URL is a site URL which the user actually clicks. The second set of candidate URLs may include: among the plurality of common site URLs corresponding to the second data, a common site URL which the user actually clicks.

S302: and respectively filtering the first candidate URL set and the second candidate URL set according to the universal URL negative example set to obtain a first current URL set and a second current URL set.

In the embodiment of the invention, the website URLs in the first candidate URL set and the universal URL negative case set can be deleted, the second candidate URL set is processed in a similar manner, and the matching accuracy of the user click behavior log as a reference can be effectively improved.

The universal URL negative case set may be obtained from the user click behavior log in a statistical manner, which is not limited to this.

S303: and respectively screening out URLs of which the click times are greater than or equal to a first preset value from the first current URL set and the second current URL set as a first target URL set and a second target URL set.

Wherein the first preset value is preset.

The first preset value may be set by a user according to a requirement, or may be preset by a factory program of an execution device of the data disambiguation method, which is not limited to this.

In the embodiment of the invention, the URLs in the first current URL set and the second current URL set are screened, so that the matching accuracy of the user click behavior log as a reference can be further improved, and the referability of the user click behavior log is improved.

S304: and judging whether the similarity between the first target URL set and the second target URL set and the candidate URL set mined in the history meets a preset condition or not.

Wherein the preset condition is preset.

The preset condition may be set by a user according to a requirement, or may be preset by a factory program of an execution device of the data disambiguation method, which is not limited to this.

The preset conditions are as follows: in the first target URL set, the second target URL set and the candidate URL set mined in the history, the proportion value of the URLs of the intersection part occupying the first target URL set and the second target URL set is larger than or equal to a preset threshold value.

S305: and taking the URLs meeting the preset conditions as a first final URL set and a second final URL set.

According to the embodiment of the invention, the URLs meeting the preset conditions are used as the first final URL set and the second final URL set, so that the matching accuracy of the user click behavior log as a reference can be further improved, and the referential property of the user click behavior log is improved.

S306: and taking the first final URL set, the second final URL set and the first characteristic and the second characteristic as the input of a GBDT decision tree algorithm, and taking the output of the algorithm as a classification model corresponding to the class to be classified.

Wherein, each URL in the first final URL set and the second final URL set can be used as the input of GBDT decision tree algorithm to obtain the output of the algorithm, meanwhile, the first characteristic and the second characteristic are used as the input of the GBDT decision tree algorithm to obtain the output of the algorithm, since the first final URL set and the second final URL set have a certain association relationship (i.e., the first final URL set and the second final URL set are divided for the same category "video" to be classified), therefore, the actual class to which each piece of training data belongs may be trained by taking the first and second sets of final URLs, and the first and second features as inputs to the GBDT decision tree algorithm, namely, the output of the GBDT decision tree algorithm is used as a classification model corresponding to the class to be classified, so as to correct the label corresponding to each piece of training data.

S307: and training labels corresponding to each piece of first data and each piece of second data based on a classification model corresponding to the category to be classified.

Optionally, respectively using first features of a plurality of pieces of first data labeled as belonging to the category to be classified and second features of a plurality of pieces of second data labeled as not belonging to the category to be classified as inputs of a classification model to obtain classification labels, corresponding to the first features and the second features, output by the classification model; and training labels corresponding to each piece of first data and each piece of second data according to each first characteristic and each second characteristic.

In this embodiment, by constructing training data, labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified, determining, based on a user click behavior log, a feature related to each piece of first data and serving as a first feature, and a feature related to each piece of second data and serving as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

Fig. 4 is a schematic structural diagram of a data disambiguation apparatus according to an embodiment of the present invention.

Referring to fig. 4, the apparatus 400 includes: a construction module 401, an annotation module 402, a feature determination module 403, and a training module 404, wherein,

a construction module 401 for constructing training data.

And the labeling module 402 is configured to label each piece of data in the training data based on the category to be classified, so as to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified.

A feature determining module 403, configured to determine, based on the user click behavior log, a feature related to each piece of the first data as a first feature, and a feature related to each piece of the second data as a second feature, where the first feature and the second feature include: literal features and user behavior features.

A training module 404, configured to train labels corresponding to each piece of the first data and each piece of the second data according to the first feature and the second feature.

Optionally, in some embodiments, the user behavior feature is a click feature, and the feature determining module 403 is specifically configured to:

respectively determining the length characteristics of each piece of first data and each piece of second data;

performing word segmentation on each piece of first data and each piece of second data respectively to obtain word segmentation results, and taking the length characteristics and the word segmentation results as literal characteristics;

determining category keywords belonging to a category to be classified from a preset category keyword library, and generating a first keyword set according to the category keywords belonging to the category to be classified;

determining category keywords which do not belong to the category to be classified from a preset category keyword library, and generating a second keyword set according to the keywords which do not belong to the category to be classified;

determining a category url belonging to a category to be classified from a user click behavior log, and generating a first url set according to the category url belonging to the category to be classified;

determining a category URL which does not belong to the category to be classified according to the universal URL negative example set, and generating a second URL set according to the category URL which does not belong to the category to be classified;

taking the first keyword set and the first url set as first related recommendations corresponding to first data, and taking the second keyword set and the second url set as second related recommendations corresponding to second data;

determining the first times of clicking the first relevant recommendation by the user according to the user clicking behavior log, searching the first times of the first relevant recommendation by the user, wherein the first times of category keywords in the first keyword set are contained in the title corresponding to the website URL in the first URL set, and taking the first times as clicking characteristics corresponding to the first data;

determining a second frequency of clicking a second relevant recommendation by the user according to the user clicking behavior log, searching the second frequency of the second relevant recommendation by the user, wherein a title corresponding to a website URL in a second URL set comprises the second frequency of category keywords in a second keyword set, and taking the second frequency as a clicking characteristic corresponding to second data;

and taking the literal feature and the click feature of each piece of first data as the corresponding first features, and taking the literal feature and the click feature of each piece of second data as the corresponding second features.

The training module 404 is specifically configured to:

generating a first candidate URL set according to the click characteristics of the first data, and generating a second candidate URL set according to the click characteristics of the second data;

respectively filtering the first candidate URL set and the second candidate URL set according to the universal URL negative case set to obtain a first current URL set and a second current URL set;

respectively screening out URLs of which the click times are greater than or equal to a first preset value from a first current URL set and a second current URL set as a first target URL set and a second target URL set;

judging whether the similarity between the first target URL set and the second target URL set and the candidate URL set mined in the history meets a preset condition or not;

taking URLs meeting preset conditions as a first final URL set and a second final URL set;

taking the first final URL set, the second final URL set and the first characteristic and the second characteristic as the input of a GBDT decision tree algorithm, and taking the output of the algorithm as a classification model corresponding to the category to be classified;

and training labels corresponding to each piece of first data and each piece of second data based on a classification model corresponding to the category to be classified.

The training module 404 is further specifically configured to:

respectively taking first features of a plurality of pieces of first data marked as belonging to a category to be classified and second features of a plurality of pieces of second data marked as not belonging to the category to be classified as input of a classification model to obtain classification labels which are output by the classification model and correspond to the first features and the sum of the second features;

and training labels corresponding to each piece of first data and each piece of second data according to each first characteristic sum and/or each second characteristic.

Optionally, in some embodiments, referring to fig. 5, the apparatus 400 further comprises:

a generating module 405, configured to generate a universal URL negative example set according to the universal website URL, where the universal URL negative example is: the user clicks on a URL other than the universal website URLs matching the category to be classified and each piece of the first data and each piece of the second data among the universal website URLs.

It should be noted that the foregoing explanation of the data disambiguation method in the embodiments of fig. 1 to fig. 3 also applies to the data disambiguation apparatus 400 in this embodiment, and the implementation principle thereof is similar and will not be described herein again.

Embodiments of the present invention also provide a computer device, and referring to fig. 6, the computer device 700 may include one or more of the following components: a processor 701, a memory 702, a power circuit 703, a multimedia component 704, an audio component 705, an interface for input and output (I and O) 706, a sensor component 707, and a communication component 708.

A power supply circuit 703 for supplying power to each circuit or device of the computer apparatus; the memory 702 is used to store executable program code; the processor 701 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 702 for performing the steps of:

constructing training data;

labeling each piece of data in the training data based on the category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified;

determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first and second features comprising: literal features and user behavior features;

and training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics.

It should be noted that the foregoing explanations of the data disambiguation method embodiment in fig. 1 to fig. 3 also apply to the computer device 700 of this embodiment, and the implementation principles thereof are similar and will not be described herein again.

To achieve the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium, in which instructions, when executed by a processor of a terminal, enable the terminal to perform a data disambiguation method comprising:

constructing training data;

The non-transitory computer-readable storage medium in this embodiment is configured to construct training data, label each piece of data in the training data based on a category to be classified, obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified, determine, based on a user click behavior log, a feature related to each piece of the first data and serve as a first feature, and a feature related to each piece of the second data and serve as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

To achieve the above embodiments, the present invention further provides a computer program product, wherein when instructions in the computer program product are executed by a processor, a data disambiguation method is performed, the method comprising:

constructing training data;

The computer program product in this embodiment is configured to, by constructing training data, label each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified, determine, based on a user click behavior log, a feature related to each piece of first data and serve as a first feature, and a feature related to each piece of second data and serve as a second feature, where the first feature and the second feature include: the literal characteristics and the user behavior characteristics are trained, labels corresponding to each piece of first data and each piece of second data are trained according to the first characteristics and the second characteristics, deep mining can be performed on data of a user click behavior log, reference data in the data can be extracted for analysis, scenes can be combined in a multi-azimuth mode, data disambiguation accuracy is greatly improved, meanwhile, time and cost of data disambiguation are reduced, and cost reduction and automatic disambiguation effects of data disambiguation are achieved.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of data disambiguation comprising the steps of:

constructing training data;

labeling each piece of data in the training data based on a category to be classified to obtain a plurality of pieces of first data labeled as belonging to the category to be classified and a plurality of pieces of second data labeled as not belonging to the category to be classified;

determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on the user click behavior log, the first feature and the second feature comprising: literal features and user behavior features;

training labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics;

the training labels corresponding to each piece of first data and each piece of second data according to the first feature and the second feature includes:

respectively filtering the first candidate URL set and the second candidate URL set according to a universal URL negative case set to obtain a first current URL set and a second current URL set;

respectively screening out URLs of which the click times are greater than or equal to a first preset value from the first current URL set and the second current URL set as a first target URL set and a second target URL set;

judging whether the similarity between the first target URL set and the second target URL set and a candidate URL set mined in the history meets a preset condition or not;

taking URLs meeting the preset conditions as a first final URL set and a second final URL set;

taking the first final URL set, the second final URL set and the first characteristic and the second characteristic as input of a GBDT decision tree algorithm, and taking output of the algorithm as a classification model corresponding to the category to be classified;

and training labels corresponding to each piece of first data and each piece of second data based on the classification model corresponding to the category to be classified.

2. The data disambiguation method according to claim 1, wherein said user behavior feature is a click feature, said determining a feature related to each piece of first data as a first feature and a feature related to each piece of second data as a second feature based on a user click behavior log, said first feature and said second feature comprising: literal and user behavioral characteristics, including:

performing word segmentation on each piece of first data and each piece of second data respectively to obtain word segmentation results, and taking the length characteristics and the word segmentation results as the literal characteristics;

determining category keywords belonging to the category to be classified from a preset category keyword library, and generating a first keyword set according to the category keywords belonging to the category to be classified;

determining category keywords which do not belong to the category to be classified from the preset category keyword library, and generating a second keyword set according to the keywords which do not belong to the category to be classified;

determining a category url belonging to the category to be classified from the user click behavior log, and generating a first url set according to the category url belonging to the category to be classified;

determining a category URL which does not belong to the category to be classified according to a universal URL negative example set, and generating a second URL set according to the category URL which does not belong to the category to be classified;

taking the first keyword set and the first url set as first related recommendations corresponding to the first data, and taking the second keyword set and the second url set as second related recommendations corresponding to the second data;

determining a first frequency of clicking the first relevant recommendation by the user according to the user clicking behavior log, searching the first frequency of the first relevant recommendation by the user, wherein a title corresponding to a website URL in the first URL set comprises the first frequency of category keywords in the first keyword set, and taking the first frequency as a clicking characteristic corresponding to the first data;

determining a second frequency of clicking the second relevant recommendation by the user according to the user clicking behavior log, searching the second frequency of the second relevant recommendation by the user, wherein a title corresponding to a website URL in the second URL set comprises the second frequency of category keywords in the second keyword set, and taking the second frequency as a clicking characteristic corresponding to the second data;

and taking the literal feature and the click feature of each piece of first data as corresponding first features, and taking the literal feature and the click feature of each piece of second data as corresponding second features.

3. The data disambiguation method of claim 1, wherein said training labels corresponding to said each piece of first data and said each piece of second data based on said classification model corresponding to said category to be classified comprises:

respectively taking first features of a plurality of pieces of first data marked as belonging to the category to be classified and second features of a plurality of pieces of second data marked as not belonging to the category to be classified as inputs of the classification model to obtain classification labels which are output by the classification model and correspond to the first features and the second features;

and training labels corresponding to each piece of first data and each piece of second data according to each first characteristic and each second characteristic.

4. The data disambiguation method according to claim 1 or 2, further comprising:

generating the universal URL negative example set according to the universal website URL, wherein the universal URL negative example is as follows: and clicking URLs except the universal website URLs matched with the category to be classified and the first data and the second data in the universal website URLs by a user.

5. A data disambiguation apparatus, comprising:

a construction module for constructing training data;

the marking module is used for marking each piece of data in the training data based on the category to be classified to obtain a plurality of pieces of first data marked as belonging to the category to be classified and a plurality of pieces of second data marked as not belonging to the category to be classified;

a feature determination module, configured to determine, based on the user click behavior log, a feature related to each piece of first data as a first feature, and a feature related to each piece of second data as a second feature, where the first feature and the second feature include: literal features and user behavior features;

the training module is used for training the labels corresponding to each piece of first data and each piece of second data according to the first characteristics and the second characteristics;

the training module is specifically configured to:

6. The data disambiguation apparatus of claim 5, wherein said user behavior characteristic is a click characteristic, said characteristic determining module being configured to:

7. The data disambiguation apparatus of claim 5, wherein said training module is further specifically configured to:

8. The data disambiguation apparatus of claim 5 or 6, further comprising:

a generating module, configured to generate the universal URL negative example set according to a universal website URL, where the universal URL negative example is: and clicking URLs except the universal website URLs matched with the category to be classified and the first data and the second data in the universal website URLs by a user.

9. A computer device, comprising one or more of the following components: a processor, a memory, a power circuit, a multimedia component, an audio component, an interface for input and output, a sensor component, and a communication component; wherein, the circuit board is arranged in the space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing:

constructing training data;

10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a data disambiguation method as claimed in any one of claims 1-4.