CN111831949A

CN111831949A - Rapid vertical category identification classification method, classification system and classification device

Info

Publication number: CN111831949A
Application number: CN201910322042.XA
Authority: CN
Inventors: 邢智慧; 胡元元; 王海威; 张博
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2020-10-27
Anticipated expiration: 2039-04-22
Also published as: CN111831949B

Abstract

The invention provides an identifier classification method, and belongs to the technical field of Internet content identification and classification. The method comprises the following steps: obtaining an annotation sample set, wherein the annotation sample set comprises a vector of known class digital content of a known class object and a vector of known class identifiers of different levels of the known class digital content; and acquiring a target class sample set and a target set with identifiers to be classified, and identifying the identifiers to be classified in the target set as target classes the same as the target class sample set by combining the labeled sample set according to a transfer learning method.

Description

Rapid vertical category identification classification method, classification system and classification device

Technical Field

The present invention relates to the field of internet content identification and classification technology, and in particular, to an identifier classification method, an identifier application method, an identifier classification system, a classification apparatus, and a computer-readable storage medium.

Background

With the popularization and rapid development of the internet, users continuously search for actual, practical and beneficial information on the internet, and the users have more and more extensive requirements on the information of the internet and expect more and more abundant information. The traditional web page search is being upgraded to content search, the first generation of large and full horizontal websites (also called comprehensive websites) cannot completely meet the requirements of users, and attention is focused on certain specific fields or vertical resources providing specific services as new highlights of the internet, which is attracting more and more attention.

Vertical resources: information provided on the internet belonging to a certain specific field or related service resources meeting a certain specific need, such as commodity content, video content, picture content, medical content, etc.

The significance of vertical resources to the consumer is obvious. With the dramatic growth of internet users and content on the web, the transition from general to specialized sources is natural. The vertical resources are the aggregation of professional resources in a certain industry, and are returned to the user in a certain form after being processed.

At present, the following two main ways are used for recording verticality:

(1) manually marking out a plurality of main large stations and corresponding rules, firstly screening out resources meeting requirements from the stations, manually screening out parts of the stations mainly relating to the vertical resources, manually marking whether the selected stations belong to the vertical resources or not after sampling and selecting Uniform Resource Locators (URLs) of different modes (patterns), generalizing the URLs belonging to the vertical into the patterns according to marking results, wherein the resources corresponding to the patterns are the vertical resources;

(2) making an identification model from the beginning, and identifying the type of resources from the resources of the whole network;

in the prior art, CN101520798A discloses a web page classification technique based on vertical search and focused crawler, which mainly includes extracting URL features and analyzing the web page content of the URL, and determining classification strategies and operations by judging whether the web page satisfies the web page structural features obtained by regular expression timing learner pattern learning, belonging to the link identification and classification collection in the whole network through the URL and pattern features.

According to the above, the main disadvantages of the prior art are the high labor cost, the slow speed and the narrow coverage of the vertical resource identification:

(1) the labor cost is high: the sites are screened first, and then the links are labeled by a labeling team. Especially, in the labeling process of a single link, a large amount of manpower is needed for labeling for multiple days; in the process of continuously enlarging the subsequent delineation station, a large amount of continuous labor consumption is still needed; a large number of large labeled samples are needed when the identification model is made from the beginning, and about one hundred thousand samples are needed for a deep learning model with a good effect;

(2) the speed is slow: in the marking link, generally, the day-level marking links are about one thousand, and the week-level or even month-level time is needed to realize marking only by identifying the vertical resources in a small number of circled sites; when the station needs to be enlarged each time, a longer time period is needed for completing the identification of the newly added station internal vertical resources; the model is trained from the beginning, and because the data volume is large, the time consumed in the labeling and training process is long due to a large number of model parameters;

(3) the coverage range is narrow: firstly, manually delineating a site can only obtain a small number of famous big sites, and the manual work cannot achieve all-round coverage in the face of such rich internet resources; secondly, only partial resources under a plurality of sites belong to the target verticals, the resources are often ignored by manual screening and sampling, the coverage range can be theoretically improved by the scheme of training the model from the beginning, but the required samples are large, the initial model is low in standard calling due to long period, the coverage condition of the verticals is not ideal, and the time cost directly causes that the deep learning scheme cannot realize the classified collection of the verticals of different types.

Therefore, in view of the increasingly strong demand of each product side for vertical resources and in order to further improve the search effect and the user experience, it is necessary to provide a scheme capable of accurately classifying and listing vertical resources, rapidly migrating and identifying classified resources, and accurately covering vertical resources.

Disclosure of Invention

The invention aims to provide a quick vertical category identification classification method, a classification system and a classification device, and can quickly use a small amount of vertical resources to finish transfer learning through a transfer learning model under a currently selected vertical category, so that the content of the currently selected vertical category is recorded and covered more accurately, and a search engine can have high recall rate.

In order to achieve the above object, an embodiment of the present invention provides an identifier classification method, including the following steps:

s1), obtaining a set of annotation samples, wherein the set of annotation samples has a vector of known class digital content of a known class object and a vector of known class identifiers of different levels of the known class digital content;

s2) acquiring a target class sample set and an object set with identifiers to be classified, and identifying the identifiers to be classified in the object set as the same target class as the target class sample set by combining the labeled sample set according to a transfer learning method.

Specifically, step S1) includes:

s101) acquiring the digital content of the known class object and the identifier of the digital content;

s102) selecting current digital content and a current identifier of the current digital content, analyzing the current identifier to obtain a hierarchy feature and a structural attribute feature, generating a classifier of the current identifier according to the hierarchy feature and the structural attribute feature, selectively identifying the current digital content as known-class digital content according to the classifier and the class of the known-class object corresponding to the current identifier, and selectively identifying the current identifier as a known-class identifier according to the classifier and the known-class digital content;

s103) after step S102) traversing each identifier, forming a vector of the known digital content and a vector of the known identifier according to the distribution characteristics determined by the category identifier and the hierarchy characteristics of the identifier, and forming a labeling sample set by the vector of the known digital content and the vector of the known identifier.

Specifically, the step S102) further includes, after generating the classifier of the current identifier according to the hierarchy feature and the structural attribute feature and before selectively identifying the current digital content as the known class digital content according to the classifier and the class of the known class object:

when the category symbol belongs to the index class, analyzing the text of the current digital content to obtain the characteristic information of the object corresponding to the text, and adding a descriptor into the category symbol according to the characteristic information.

Specifically, step S2) includes:

s201) selecting a basic machine learning model, and updating neurons of the basic machine learning model by using the labeled sample set to obtain a current machine learning model;

s202) obtaining a target type sample set, selecting a migration rule according to the distinguishing characteristics of the target type sample set and the labeling sample set, and updating part of neurons of the current machine learning model by using the target type sample set and the labeling sample set according to the migration rule to obtain a migration machine learning model, wherein the target type sample set is provided with a vector of target type digital content of a target classification object and a vector of a target type identifier of the target type digital content, and the distinguishing characteristics comprise the distinguishing characteristics of the vector of the target type digital content and the vector of the known type digital content, and the vector of the target type identifier and the distinguishing characteristics of the vector of the known type identifier;

s203) acquiring a target set with identifiers to be classified, inputting the target set into the migration machine learning model to obtain a prediction result, and identifying the identifiers to be classified corresponding to the prediction result meeting preset conditions in the prediction result as target classes with the same identifiers as the target classes.

Specifically, step S201) further includes:

s211) generating an attribute vector set corresponding to the distribution characteristics of the labeled sample set according to the attributes of the known digital content;

s212) selecting a deep neural network learning model, and updating neurons of the deep neural network learning model by using the attribute vector set;

s213) adding the updated deep neural network learning model neurons with the classifier function into the neurons of the current machine learning model with the classifier function.

Specifically, the step S202) of obtaining the target class sample set includes:

s221) acquiring the digital content of the target class object and the identifier of the digital content;

s222) selecting current digital content and a current identifier of the current digital content, analyzing the current identifier to obtain a hierarchy feature and a structural attribute feature, generating a class identifier of the current identifier according to the hierarchy feature and the structural attribute feature, selectively identifying the current digital content as target class digital content according to the class identifier and the class of the target class object corresponding to the current identifier, and selectively identifying the current identifier as a target class identifier according to the class identifier and the known class digital content;

s223) after step S222) of traversing each identifier, forming a vector of the target class digital content and a vector of the target class identifier according to the distribution characteristics determined by the category identifier and the hierarchy characteristics of the identifier, and forming a target class sample set by the vector of the target class digital content and the vector of the target class identifier.

Specifically, the step S202) of selecting a migration rule according to the distinguishing characteristics of the target type sample set and the labeled sample set includes:

s224) obtaining a first proportion of the vector quantity of the target digital content to the vector quantity of the known digital content, obtaining a first similarity of the vector of the target digital content and the vector of the known digital content, and mapping the first proportion and the first similarity into a first proximity degree, wherein the first proximity degree is a distinguishing characteristic of the vector of the target digital content and the vector of the known digital content;

s225) obtaining a second proportion of the vector quantity of the target class identifier and the vector quantity of the known class identifier, obtaining a second similarity of the vector of the target class identifier and the vector of the known class identifier, and mapping the second proportion and the second similarity into a second proximity, wherein the second proximity is a distinguishing characteristic of the vector of the target class identifier and the vector of the known class identifier;

s226) mapping the first proximity degree and the second proximity degree into a third proximity degree, using the third proximity degree as a distinguishing characteristic of the target class sample set and the labeling sample set, selecting a migration action range and a migration rule with the class characteristic of the target class object according to the third proximity degree, and selecting part of neurons in the current machine learning model neurons according to the migration action range.

Specifically, in step S202), updating a part of neurons of the current machine learning model by using the target class sample set and the labeled sample set according to the migration rule to obtain a migrated machine learning model, including:

s227) when the third closeness degree is larger than a preset threshold value, selecting the migration action range as a single-layer range where a last output layer of the current machine learning model is located, replacing an activated neuron of the last output layer with two classification neurons according to the migration rule, and selecting a random weight value as a weight value of the two classification neurons;

s228) setting the learning rate of neurons in the current machine learning model relative to the remaining layers of the last output layer to be much smaller than the learning rate in the process of obtaining the current machine learning model by using the labeled sample set to update the neurons of the base machine learning model.

Specifically, step S228) includes:

firstly, setting the learning rates of input layer neurons and neurons in the input layer neighborhood in the current machine learning model to be zero;

secondly, the learning rate of the neurons in the current machine learning model relative to the rest of the last output layer except for the neurons in the input layer and the neurons in the neighborhood of the input layer is set to be much smaller than the learning rate in the process of obtaining the current machine learning model by updating the neurons of the basic machine learning model by using the labeled sample set.

The embodiment of the invention provides an identifier application method, which comprises the following steps:

the method of any one of claims 1 to 9 is utilized to obtain the identifier belonging to the target class, and the identifier and the digital content corresponding to the identifier are mapped to generate an index relationship.

The embodiment of the invention provides an identifier classification system, which comprises:

the migration recognition engine is used for receiving the labeling sample data set, receiving a target class sample data set and receiving a target data set with identifiers to be classified, and is also used for recognizing the identifiers to be classified in the target data set into a target class the same as the target class sample data set according to a migration learning method by combining the labeling sample data set;

wherein the annotation sample data set has a vector of known class digital content data of a known class object and a vector of known class identifiers of different levels of the known class digital content data.

Optionally, the migration recognition engine includes:

an identification engine having a base machine learning model for receiving a set of labeled sample data, updating the base machine learning model from the set of labeled sample data to a current machine learning model, and receiving a set of target type sample data;

the control engine is used for receiving an identification request signal sent by the identification engine, calculating the distinguishing characteristics of the labeling sample data set and the target sample data set according to the identification request signal, generating a migration signal according to the distinguishing characteristics, and returning the migration signal to the identification engine;

the recognition engine is further used for calculating the labeling sample data set and the target sample data set according to the migration signal and updating a current machine learning model into a migration machine learning model at the same time;

the recognition engine is further used for receiving a target data set with identifiers to be classified, calculating the target data set through the migration machine learning model, and determining identifiers belonging to a target class in the target data set through preset conditions;

wherein the annotation sample dataset has a vector of known class digital content data of a known class object and a vector of known class identifiers of different levels of the known class digital content data;

wherein the target class sample data set has a vector of target class digital content data of a target classification object and a vector of a target class identifier of the target class digital content data;

wherein the identification request signal has sample data information of the annotation sample data set and the target sample data set;

wherein the distinguishing features comprise distinguishing features of the vector of the target class digital content data and the vector of the known class digital content data, and distinguishing features of the vector of the target class identifier and the vector of the known class identifier;

wherein the migration signal has migration rule information.

Optionally, the migration recognition engine further includes:

a parsing engine;

the annotation system is provided with an annotation cycle and is used for receiving the digital content data of the current class object and the identifier of the digital content data at the beginning of the annotation cycle;

the annotation system is further configured to select and send current digital content data and a current identifier of the current digital content data to the parsing engine;

the analysis engine is also used for analyzing the current identifier to generate and send a characteristic signal with the hierarchy characteristic information and the structure attribute characteristic information of the current identifier to the labeling system;

the annotation system is further configured to generate a category identifier of the current identifier from the characteristic information, selectively identify the current digital content data as a current category digital content data from the category identifier and a category of the current category object corresponding to the current identifier, and selectively identify the current identifier as a current category identifier from the category identifier and the current category digital content data;

wherein, after traversing each identifier, the annotation system is further configured to generate a vector of the current-class digital content data and a vector of the current-class identifier from the distribution features determined by the category identifier and the hierarchy features of the identifier, constitute the vector of the current-class digital content data and the vector of the current-class identifier into a current-class sample set, and end the annotation cycle;

the identification engine is also used for generating and sending a request signal of the labeling sample data set to the labeling system;

the marking system is also used for calling and sending a marking sample data set to the identification engine by the marking sample data set request signal;

the identification engine is also used for generating and sending a target sample data set request signal to the labeling system;

the marking system is also used for updating the current class object to be the target class object by the target class sample data set request signal, generating a target class sample data set through the marking cycle, and sending the target class sample data set to the identification engine.

Optionally, when the tagging system is further configured to determine that the category identifier belongs to the index class, the tagging system generates and sends a content parsing request to the parsing engine;

the analysis engine is also used for calculating the current digital content data according to the content analysis request, generating and sending a compensation characteristic signal with object characteristic information corresponding to texts in the current digital content data to the labeling system;

the labeling system is also configured to generate a descriptor from the compensated feature signal and add the descriptor to the classifier in the labeling loop.

Optionally, the migration recognition engine further includes:

a compensation identification engine having a deep neural network learning model for receiving a set of attribute vectors generated by the control engine by the attributes of the known class digital content data corresponding to the distribution characteristics of the set of labeled samples, the deep neural network learning model being updated by the set of attribute vectors;

wherein the control engine transmits the updated deep neural network learning model neurons with classifier functions and the weight parameter sets of the neurons to the neurons with classifier functions in the current machine learning model.

Optionally, the migration recognition engine further includes:

the window engine is used for setting learning rate and a migration action range, receiving the migration signal and generating a neuron replacement signal and a weight updating signal by the migration signal;

a weight generator for generating weight values for updating neurons;

wherein the window engine is further configured to replace an activated neuron of a last output layer of the current machine learning model with a binary neuron by a neuron replacement signal when the distinguishing feature calculated by the control engine is greater than a preset threshold, and to take a random weight value for a weight value of the binary neuron by a weight generator by a weight update signal;

wherein the control engine is further configured to set, via the window engine, a learning rate of neurons in a remaining layer in the current machine learning model relative to the last output layer that is substantially less than a learning rate in updating the base machine learning model to the current machine learning model from the set of annotation sample data.

Optionally, the control engine is further configured to set, by the window engine, learning rates of input layer neurons and neurons in the input layer neighborhood in the current machine learning model to be zero;

the control engine is further configured to set, by the window engine, a learning rate of neurons in a remaining layer in the current machine learning model, other than the input layer neurons and neurons in a neighborhood of the input layer, relative to the last output layer, that is much smaller than a learning rate in updating the base machine learning model to the current machine learning model from the set of labeled sample data.

In another aspect, an embodiment of the present invention provides a classification apparatus, including:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implements the aforementioned method by executing the instructions stored by the memory.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions, which, when executed on a computer, cause the computer to perform the foregoing method.

Compared with the prior art, the invention has the beneficial effects that:

transferring the learning model through the known class samples classified in a layered mode and the target class samples classified in a layered mode, and achieving vertical class resource identification and classification in the target class;

the method realizes that the known resources are labeled as known samples classified hierarchically;

distinguishing different index categories in the index categories, such as a user index category or a company index category;

the attribute of the digital content is increased through the deep neural network learning model, and the discrimination of the identification process, namely the short attribute, such as the page value and the page length, can be remarkably improved;

the target resources are labeled as target samples classified in a layered manner;

the difference between the known class sample and the target class sample is quantified, so that the migration degree of the current machine learning model can be determined;

when the known class sample and the target class sample are similar, the rapid identification classification is realized by replacing the neuron of the final output layer with the neuron which is more related to the target vertical class, such as two classification neurons;

when the known class sample is similar to the target class sample, setting the learning rate of an input layer for feature extraction to zero and taking a relatively smaller learning rate for neurons of a hidden layer close to the middle to fully utilize the weight value determined by the known class sample in the machine learning model;

the method has the unique function of rapidly transferring to the target vertical category for identification and classification, and the classification results are recorded in sequence, so that the recording results have high calling accuracy.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a system architecture diagram according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system architecture with main data processing according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

Example 1

The embodiment of the invention provides an identifier classification method, which comprises the following steps:

firstly, obtaining an annotation sample set, wherein the annotation sample set comprises a vector of known class digital content of a known class object and vectors of known class identifiers of different levels of the known class digital content; secondly, acquiring a target type sample set and a target set with identifiers to be classified, and identifying the identifiers to be classified in the target set as target types the same as the target type sample set by combining the labeled sample set according to a transfer learning method.

The method further specifically comprises the following steps:

one object may have many digital contents, the digital contents may have many identifiers, and different objects may have the same digital contents (cloned contents, such as a website realized by nginx reverse proxy or a website realized by code cloning), but different digital contents cannot be realized by the same identifier at the same time;

the digital content represents all digital information of an object published on the internet and various carriers of the digital information, including carriers of various digital information and digital information such as network addresses, domain names, routing services, physical servers, websites and contents provided by the websites;

the identifiers include links, pages, network addresses, and other characteristic marks that can be used to locate a certain digital content, and then, for the identifiers, there is usually a hierarchical relationship, for example, the links of the official website of the chinese weather bureau are http: // www.cma.gov.cn, the page pointed to by the link is the home page, see http: the first level (i.e., the first level) is the first/www.cma.gov.cn, and the link where the weather page is located is http: // www.

The link level is the third layer, namely the digital content comprises a weather forecast page, and the identifier corresponding to the weather forecast page comprises the link of the third layer; generally, the hierarchy is increased by one layer, and the characteristics on the links are represented by one more "/";

the known class (original classification) object or the target class (vertical class) object is not a known class object or a target class object, and the known sample is necessarily rich in classes in machine learning and can be used for training and implementing the classifier, so that the known class object is an object corresponding to a large data set sample collected for a long time, the target class object is an object of some classes temporarily collected in a selected target class range, and the object, the digital content of the hierarchical classification and the identifier of the hierarchical classification in the large data set sample belong to some known classes or at least part of some known classes;

in the site and site content listing examples, the digital content may be page content provided on a server and a server, the attribute of the digital content may be a page value and a page length (also referred to as a short attribute), and the identifier may be a link, a link form, a link hierarchy, a link attribute and a page structure attribute;

the link form of the identifier carries a lot of information, such as a site to which the identifier belongs, link depth, and a page type (content page or index page), and the identifier also often has strong features in the identification of many vertical resources, for example, a video page link has a play or video field, a commodity page has fields such as item, buy, and the like, and a url is cut into words according to a separator "/", and can be used as a first dimension feature of a vector;

the link content of the identifier can also assist the classification of the page, and information of three secondary dimensions of the link can be extracted, wherein the extracted link and the original link belong to the same site information, the suffix information of the link, such as html, jpg and the like, reflects the link form, and the source text of the link reflects the link content information;

the page structure attribute information of the identifier and the structure characteristics of the webpage can be used as auxiliary content information to carry out more detailed classification on the page, the form of the page is clear, such as an index page or a content page, various different plate information in the page and the like, and tag and class information in html source codes are extracted as characteristics in the page;

and page content, which is the most obvious characteristic in the field of distinguishing the web pages, is extracted and then subjected to word segmentation processing and can be used as a second-dimension characteristic of the vector.

S2) selecting a basic machine learning model, and then updating neurons of the basic machine learning model by using the labeled sample set to obtain a current machine learning model;

the base machine learning model may be a text convolutional neural network learning model with random weights (textcnn);

the current machine learning model can be a text convolution neural network learning model trained by a labeled sample set, the process can be called pre-training of original classified samples based on hierarchical classification, and the pre-training can adopt a large number of labeled samples of page values or types (content pages, user pages, platform pages, home pages, common index pages and the like), for example, the pre-training is carried out by adopting the samples of the page types;

the input information comprises the information of the two dimensions, the text convolutional neural network learning model is used for effectively extracting the information of the two dimensions, partial layers of the convolutional layer are mainly used for extracting general information, and after the information related to the structure and the content is extracted, the information is combined by the full connection layer to output a result more related to a specific classification task;

inputting vertical data of the pre-trained model under the condition that the model parameters are not changed, and finally obtaining a page type label;

the text convolutional neural network learning model is high in training speed, basic features related to page structures and contents can be effectively extracted according to a text sequence, a full connection layer and a normalized exponential function (softmax) network layer in the later partial layer aim at a specific scene, and the fact that the feature extraction layer is independent from a data set and a classifier layer is dependent on the data set is found out, so that the selected model has the capability of extracting shallow basic features and deep abstract features.

S3) acquiring a target sample set, selecting a migration rule according to the distinguishing characteristics of the target sample set and the labeling sample set, and updating part of neurons of the current machine learning model by using the target sample set and the labeling sample set according to the migration rule to obtain a migration machine learning model, wherein the target sample set has a vector of target digital content of a target classification object and a vector of a target identifier of the target digital content, and the distinguishing characteristics comprise the distinguishing characteristics of the vector of the target digital content and the vector of the known digital content, and the distinguishing characteristics of the vector of the target identifier and the vector of the known identifier;

migration rules, which may include sample cross-migration rules, mapping space migration rules, network attribute inheritance migration rules and anti-migration rules, may be selected according to the categories of the objects, digital contents and identifiers involved and the distinguishing features currently described, for example, the mapping space migration rules and the attribute inheritance migration rules may also be combined as the migration rules used herein, for the links and the contents, each may be represented by an ordered string, the conversion of the links and the digital contents into a string having hash code features may be realized by an encryption algorithm (for example, BASE64, SHA1, MD5, etc.), and classification descriptors and hierarchy descriptors may be further set, after the sorting, the character expressions of the hierarchically classified links and contents may be realized by an encryption algorithm again, for example, a ═ 1, 0, 0)^T，B＝(0，1，0)^T，C＝(0，1，1)^TFor an ABC string, the vector is:

and then mapping the vector of the character string into a reference vector space by a word embedding method, so that dimension reduction can be realized, for example, embedding an XOY plane:

the vector of ABC string is now updated to 3x2 form.

S4) acquiring a target set with identifiers to be classified, inputting the target set into the migration machine learning model to obtain a prediction result, and identifying the identifiers to be classified corresponding to the prediction result meeting preset conditions in the prediction result as target classes the same as the identifiers of the target classes;

the normalized index function output is the probability for identifying the class to which the identifier belongs, the probability can be distinguished to a certain extent through a preset condition, a certain constraint relation and a threshold value, and then the identifier to be classified which meets the requirement is classified into a target class.

Specifically, step S1) includes:

s103) after each identifier is traversed in the step S102), forming a vector of the known digital content and a vector of the known identifier according to the distribution characteristics determined by the category identifier and the hierarchy characteristics of the identifier, and enabling the vector of the known digital content and the vector of the known identifier to form a labeling sample set;

in the process of marking the link, the content corresponding to the link can be determined to be a first page or other pages through the hierarchical characteristics, and the content corresponding to the link can be further determined to be an index page (a user index page or a platform index page) or a content page and the like in other pages according to the structural attribute characteristics.

when the category symbol belongs to an index class, analyzing the text of the current digital content to obtain the characteristic information of an object corresponding to the text, and adding a descriptor into the category symbol according to the characteristic information;

in the process of further labeling the link, it can be determined that the content corresponding to the link is a user index page or a platform index page in the index page through text recognition of the current digital content, for example, the user page may have information such as a name, and the platform page may have information such as a company name.

Specifically, step S2) further includes:

s201) generating an attribute vector set corresponding to the distribution characteristics of the labeled sample set according to the attributes of the known digital content;

s202) selecting a deep neural network learning model (dnn), and updating neurons of the deep neural network learning model by using the attribute vector set;

s203) adding the updated deep neural network learning model neurons with the classifier function into the neurons of the current machine learning model with the classifier function;

the attribute is short attribute, in the process of webpage capture, the short attribute is calculated for the webpages corresponding to different links through a basic scoring model, the page length, the page value and the like of the webpage corresponding to the link can be obtained, the short attribute is used for assisting in identifying the page type and can be used as a compensation feature, and the discrimination is remarkably improved;

generally, the ith neuron of the l-th layer of the fully-connected layer

Has the following forms:

wherein, a^lAs a function of layer i activation, W^lIs the l-th layer weight value, b^lIs the l layer bias parameter;

since the word vector (mapping vector with original vector embedded in reference plane or relatively less dimensional space) of the character string corresponding to the link or digital content is binary expression, a binary classification label Cat can be defined_ij，Cat_ijE {0, 1}, i or j is the number of the ith or jth neuron in the current layer, an activation function is selected as a two-classification activation function, such as a Sigmoid function (Sig), and the vector inner product of the ith and jth neurons is obtained

Mapping as a characteristic Hamming distance (Hammingdistance)

Then through binary classification label Cat_ijObtaining a prediction result with respect to conditional probabilities of characteristic Hamming distances between a current neuron and other selected neurons, the conditional probabilities being written as

I.e., the probability for expressing similarity, as follows:

by passing

The similarity threshold may be set, or by

Setting dissimilarity, wherein the probability value of the short attribute can be used when the conditional probability is higher than a preset threshold value, or the joint probability or new conditional probability can be obtained by combining each conditional probability with the short attribute probability value to realize further regionAnd (6) judging.

Specifically, the step S3) of obtaining the target class sample set includes:

s301) acquiring the digital content of the target class object and the identifier of the digital content;

s302) selecting current digital content and a current identifier of the current digital content, analyzing the current identifier to obtain a hierarchy feature and a structural attribute feature, generating a category identifier of the current identifier according to the hierarchy feature and the structural attribute feature, selectively identifying the current digital content as target category digital content according to the category identifier and the category of the target category object corresponding to the current identifier, and selectively identifying the current identifier as a target category identifier according to the category identifier and the known category digital content;

s303) after step S302) of traversing each identifier, forming a vector of the target class digital content and a vector of the target class identifier according to the distribution characteristics determined by the category identifier and the hierarchy characteristics of the identifier, and forming a target class sample set by the vector of the target class digital content and the vector of the target class identifier.

Specifically, the step S3) of selecting a migration rule according to the distinguishing characteristics of the target class sample set and the labeled sample set includes:

s304) obtaining a first proportion of the vector quantity of the target digital content to the vector quantity of the known digital content, obtaining a first similarity of the vector of the target digital content and the vector of the known digital content, and mapping the first proportion and the first similarity into a first proximity degree, wherein the first proximity degree is a distinguishing characteristic of the vector of the target digital content and the vector of the known digital content;

s305) obtaining a second proportion of the vector quantity of the target class identifier and the vector quantity of the known class identifier, obtaining a second similarity of the vector of the target class identifier and the vector of the known class identifier, and mapping the second proportion and the second similarity into a second proximity, wherein the second proximity is a distinguishing characteristic of the vector of the target class identifier and the vector of the known class identifier;

s306) mapping the first proximity and the second proximity to a third proximity, using the third proximity as a distinguishing feature of the target class sample set and the labeling sample set, selecting a migration action range and a migration rule with the class feature of the target class object according to the third proximity, and selecting a part of neurons in the current machine learning model neurons according to the migration action range.

Specifically, the step S3) of updating a part of neurons of the current machine learning model by using the target class sample set and the labeled sample set according to the migration rule to obtain a migrated machine learning model includes:

s307) when the third closeness degree is larger than a preset threshold value, selecting the migration action range as a single-layer range where a last output layer of the current machine learning model is located, replacing an activated neuron of the last output layer with two classification neurons according to the migration rule, and selecting a random weight value as a weight value of the two classification neurons;

s308) setting the learning rate of the neurons in the current machine learning model relative to the rest layer of the last output layer to be far less than the learning rate in the process of obtaining the current machine learning model by utilizing the labeled sample set to update the neurons of the basic machine learning model;

the migration process is that the model structure is unchanged, the parameters trained by the last softmax layer are removed, the parameters are changed into randomly generated parameters (all the parameters are randomly generated after the model structure is determined under the condition that no pre-training exists), the parameters of the embedding layer, the textcnn layer and the dnn layer are frozen, the parameters of the two layers cannot be changed in the post-migration training process, the parameters of the fully-connected layer adopt the parameters of the pre-training model as initial values, but are not frozen, the values of the two layers are changed in the training process, after the model structure is migrated, the migration training is carried out, the learning rate is reflected by a loss function in backward propagation in the migration training process, the parameters of the part which is not frozen are trained, and the model is trained into a recognition classification model aiming at a target class (vertical class);

according to the similarity degree and the owned sample amount of the selected vertical class and the original classified data set, the migration learning can cut off the last softmax layer, or cut off more, the more similar the data set, the less the data set needs to be cut off, and the more the sample amount can be cut off if enough, in extreme cases, enough samples can be used for training the model from the beginning, and in the case that the data sets are similar, a large amount of data is not needed, so that the first proportion and the second proportion are smaller, the similarity of the data sets needs to be judged, and the similarity can be embodied through the vector similarity characteristics of the data sets, such as the Euclidean distance, the Manhattan distance, the Chebyshev distance, the Minkowski distance, the normalized Euclidean distance, the Mahalanobis distance, the included angle, the cosine distance, the Jackstan similarity coefficient, the information entropy and the like of the data sets.

Specifically, step S308) includes:

s381) setting learning rates of input layer neurons and neurons in the input layer neighborhood in the current machine learning model to zero;

s382) setting a learning rate in the current machine learning model for neurons in a remaining layer other than the input layer neurons and neurons in the input layer neighborhood relative to the last output layer to be much smaller than a learning rate in obtaining the current machine learning model by updating neurons of the base machine learning model with the set of labeled samples;

using a smaller learning rate to train the network, because the weights trained in advance are much better than the random initialization weights, it is undesirable to distort these weights too quickly and too much, making the initial learning rate at this moment 10 times smaller than the initial learning rate trained from scratch;

the weights of the first few layers of the pre-trained network are frozen because they capture generic features of the page type, and it is desirable to keep these weights intact, but instead, special features in the learning dataset will be focused on in later layers, a process that can be referred to as "hinting".

The method is used for completing transfer learning, so that parameters to be learned in the process are few, the requirement for the amount of vertical samples is also few, the training speed is greatly accelerated, in the pre-training process, the model already has the capability of extracting the basic features of the page, and the model can achieve a good effect in a new classification problem by fine-tuning a small amount of samples aiming at the specific vertical requirements.

The invention can efficiently, quickly and comprehensively identify the plumbing resources, can label the sample when pre-training the model, and can inherit the labeled sample of the basic scoring model, does not need to consume a large amount of extra manpower for labeling, and the pre-trained model has good universality and can be loaded as the original classification model in the recording of different plumbing types.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solutions of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications all belong to the protection scope of the embodiments of the present invention.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention do not describe every possible combination.

Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In addition, any combination of various different implementation manners of the embodiments of the present invention is also possible, and the embodiments of the present invention should be considered as disclosed in the embodiments of the present invention as long as the combination does not depart from the spirit of the embodiments of the present invention.

Claims

1. A method of identifier classification, characterized in that the method comprises the steps of:

2. The identifier classification method according to claim 1, characterized in that step S1) includes:

3. The identifier classification method according to claim 2, wherein step S102) further comprises, after generating the classifier of the current identifier according to the hierarchy features and the structural attribute features and before selectively identifying the current digital content as the known-class digital content according to the classifier and the class of the known-class object:

4. The identifier classification method according to claim 1, characterized in that step S2) includes:

5. The identifier classification method according to claim 4, characterized in that step S201) further comprises:

6. The identifier classification method according to claim 4, wherein the step S202) of obtaining the target class sample set comprises:

7. The identifier classification method according to claim 6, wherein the step S202) of selecting the migration rule according to the distinguishing features of the target class sample set and the labeled sample set includes:

8. The identifier classification method according to claim 7, wherein the updating of the partial neurons of the current machine learning model by the target class sample set and the labeled sample set according to the migration rule in step S202) to obtain a migrated machine learning model comprises:

9. The identifier classification method according to claim 8, wherein step S228) includes:

10. An identifier application method, characterized in that the method comprises:

11. An identifier classification system, comprising:

12. The identifier classification system of claim 11, wherein the migration recognition engine comprises:

wherein the migration signal has migration rule information.

13. The identifier classification system of claim 12, wherein the migration recognition engine further comprises:

a parsing engine;

14. The identifier sorting system according to claim 13,

the labeling system is also used for generating and sending a content analysis request to the analysis engine when the category symbol is judged to belong to the index category;

15. The identifier classification system of claim 12, wherein the migration recognition engine further comprises:

16. The identifier classification system of claim 12, wherein the migration recognition engine further comprises:

a weight generator for generating weight values for updating neurons;

17. The identifier sorting system according to claim 16,

the control engine is further configured to set, by the window engine, learning rates of input layer neurons and neurons in the input layer neighborhood in the current machine learning model to zero;

18. A sorting apparatus, comprising:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1 to 9 by executing the instructions stored by the memory.

19. A computer readable storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 9.