CN110162628A

CN110162628A - A kind of content identification method and device

Info

Publication number: CN110162628A
Application number: CN201910369604.6A
Authority: CN
Inventors: 邓强; 钟滨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2019-08-23
Anticipated expiration: 2039-05-06
Also published as: CN110162628B

Abstract

The embodiment of the invention discloses a kind of content identification method and devices, are applied to technical field of information processing.In the method for the present embodiment, content identification apparatus can be according to the Text eigenvector of preset feature extraction strategy and text to be identified, obtain the corresponding multiple feature subvectors of text to be identified, then further according to multiple feature subvectors and preset machine learning model, determine that text to be identified includes the area information of specific content.Whether the method for the present embodiment is not needed by artificial, can automatically identify comprising specific content in text to be identified, if comprising that the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.

Description

A kind of content identification method and device

Technical field

The present invention relates to technical field of information processing, in particular to a kind of content identification method and device.

Background technique

At present in the text that background server is pushed to application terminal, it can wrap containing certain specific contents, such as extensively Text or vulgar text etc. are accused, as such, it is desirable to filtering be carried out to the specific content in text, so that being shown in application terminal Text will not include these specific contents.In this process, it is important that identify the specific content in text, ability Accurately it is filtered.

A kind of traditional specific content recognition methods includes: by (such as the advertisement of the keyword of hand digging specific content Registration hot line in text etc.), and relevant rule is set and text is identified, still, this method labor intensive, And easily omit very much, efficiency is very low, is also easy to appear erroneous judgement.

Another specific content recognition methods includes: to enter text into machine sort model, to extract text feature, And it is identified according to text feature.The method avoids manpower, recognition efficiency and accuracy rate are improved, but some In text, being not is all specific content in the whole text, but the top in text, bottom or intermediate insertion are a bit of specific interior Hold, such as copy, when we by entire text input to machine sort model when, due to the information content of these specific contents The information content of more entire text, can be much smaller, in most cases can not effectively identify include in text specific Content.

Summary of the invention

The embodiment of the present invention provides a kind of content identification method and device, realizes in automatic identification text to be identified and includes The region of specific content.

The embodiment of the present invention provides a kind of content identification method, comprising:

Obtain the Text eigenvector of text to be identified；

According to preset feature extraction strategy and the Text eigenvector, it is corresponding multiple to obtain the text to be identified Feature subvector；

According to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific The area information of content, the preset machine learning model are used for multiple feature subvectors according to any text, export institute The area information that any text includes specific content is stated, the preset machine learning module includes disaggregated model or recurrence mould Type.

The embodiment of the present invention provides a kind of content identification apparatus, comprising:

Vector acquiring unit, for obtaining the Text eigenvector of text to be identified；

Subvector acquiring unit, for according to preset feature extraction strategy and the Text eigenvector, described in acquisition The corresponding multiple feature subvectors of text to be identified；

Area determination unit, described in determining according to the multiple feature subvector and preset machine learning model Text to be identified includes the area information of specific content, and the preset machine learning model is used for according to the multiple of any text Feature subvector, exports the area information that any text includes specific content, and the preset machine learning module includes Disaggregated model or regression model.

The third aspect of the embodiment of the present invention provides a kind of storage medium, and the storage medium stores a plurality of instruction, the finger It enables and being suitable for as processor loads and executes the content identification method as described in first aspect of the embodiment of the present invention.

Fourth aspect of the embodiment of the present invention provides a kind of terminal device, including pocessor and storage media, the processor, For realizing each instruction；

The storage medium is for storing a plurality of instruction, and described instruction is for being loaded by processor and being executed as of the invention real Apply content identification method described in a first aspect.

As it can be seen that in the method for the present embodiment, content identification apparatus can be according to preset feature extraction strategy and to be identified The Text eigenvector of text obtains the corresponding multiple feature subvectors of text to be identified, then further according to the multiple feature Subvector and preset machine learning model determine that the text to be identified includes the area information of specific content.The present embodiment Method do not need by artificial, whether can automatically identify in text to be identified comprising specific content, if comprising can be with The region comprising specific content is precisely located out, so that the filtering to specific content is more accurate.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is a kind of schematic diagram of content identification method provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of content identification method provided by one embodiment of the present invention；

Fig. 3 is the signal for carrying out feature extraction in the embodiment of the present invention to the Text eigenvector with contextual information Figure；

Fig. 4 is the method flow diagram of training specific content identification model in the embodiment of the present invention；

Fig. 5 is the schematic diagram for the article content that wechat terminal is shown in Application Example of the present invention；

Fig. 6 is the schematic diagram of training copy identification model in Application Example of the present invention；

Fig. 7 is the structural schematic diagram of the initial model of the copy identification determined in Application Example of the present invention；

Fig. 8 is the flow chart of the copy recognition methods provided in Application Example of the present invention；

Fig. 9 is a kind of structural schematic diagram of content identification apparatus provided in an embodiment of the present invention；

Figure 10 is a kind of structural schematic diagram of terminal device provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Description and claims of this specification and term " first ", " second ", " third " " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any Deformation, it is intended that cover not exclusively include, for example, containing the process, method of a series of steps or units, system, production Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this A little process, methods, the other step or units of product or equipment inherently.

The embodiment of the present invention provides a kind of content identification method, as shown in Figure 1, can be passed through by content identification apparatus as follows The identification of step realization specific content:

Obtain the Text eigenvector of text to be identified；According to preset feature extraction strategy and the text feature to Amount obtains the corresponding multiple feature subvectors (illustrating for n in Fig. 1) of the text to be identified；According to the multiple spy Subvector and preset machine learning model are levied, determines that the text to be identified includes the area information of specific content.

Specific content in the present embodiment refers to one or more characters with special characteristic, such as copy, low Custom or nauseous text or specific character etc..

This way it is not necessary to can be automatically identified by artificial whether comprising specific content in text to be identified, if packet Contain, the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.

The embodiment of the present invention provides a kind of content identification method, mainly the method as performed by content identification apparatus, stream Journey figure is as shown in Figure 2, comprising:

Step 101, the Text eigenvector of text to be identified is obtained.

It is appreciated that the method for the present embodiment can be applied in several scenes, such as when background server is whole to application When the push text of end, background server can initiate the process of the present embodiment, at this using the push text as text to be identified In the case of kind, background server is content identification apparatus；Alternatively, after background server pushes text to application terminal, when answering After receiving push text with terminal, the process of the present embodiment can be initiated, at this using the push text as text to be identified In the case of kind, application terminal is content identification apparatus etc..As it can be seen that the triggering due to the difference of application scenarios, to this implementing procedure Mode is also different, herein to the application scenarios of the present embodiment method without limiting.

Specifically, content identification apparatus, can be according to certain plan when obtaining the Text eigenvector of text to be identified Slightly obtaining, in a kind of situation, text to be identified first can be carried out word segmentation processing by content identification apparatus, it obtain multiple participles, And according to the corresponding relationship of preset participle and vector, the corresponding vector of multiple participles is determined respectively；Finally again with multiple participles Corresponding vector is combined into the Text eigenvector of text to be identified.

Wherein, the corresponding relationship of preset participle and vector is to be arranged in content identification apparatus in advance, is indicated each Corresponding vector is segmented, can be obtained according to a large amount of samples of text, specifically, multiple samples of text are divided respectively Word, and multiple participles of the frequency of occurrences in multiple samples of text higher (for example the frequency of occurrences is greater than a certain threshold value) are counted, and The vector of this multiple participle is respectively set, the corresponding relationship respectively with vector is stored into content identification apparatus.

In this case, content identification apparatus, can be by point of text to be identified during determining Text eigenvector Word is compared with the participle in preset corresponding relationship, so that it is determined that the vector of participle.If a certain participle of text to be identified It is all inconsistent with all participles in corresponding relationship, it can determine that the corresponding vector of the participle is a certain specific vector, this is specific Vector can be to be stored into content identification apparatus in advance.

It should be noted that in other cases, content identification apparatus can also use other coding methods, obtain wait know The Text eigenvector of other text, herein without repeating.

Step 102, according to preset feature extraction strategy and above-mentioned Text eigenvector, it is corresponding to obtain text to be identified Multiple feature subvectors.This multiple feature subvector can indicate entire text to be identified after combining, i.e. feature to Amount can indicate some region of text in text to be identified, and can have the information of overlapping between adjacent feature subvector.

Specifically, in a kind of situation, content identification apparatus can directly according to the feature extraction strategy of certain sliding window, Feature extraction is carried out to Text eigenvector, obtains multiple feature subvectors.If Text eigenvector includes multiple participles Vector, when carrying out feature extraction, can in the vector of all participles of each sliding window, extract a certain participle to Amount, obtains a feature subvector.Wherein, the length of each sliding window and sliding step can be the same or different.

In another case, specific identification device can first determine the text with contextual information according to Text eigenvector Feature vector carries out feature to the Text eigenvector with contextual information then according still further to preset feature extraction strategy It extracts, obtains multiple feature subvectors.

Wherein, if Text eigenvector includes the vector of multiple participles, specific identification device is determining have up and down When the Text eigenvector of literary information, the text feature with contextual information can be directly determined according to the vector of multiple participles Vector.It can also first determine multiple participle corresponding position vectors in text to be identified, and by the vector of multiple participles It is added respectively with corresponding position vector, obtains the addition vector of multiple participles, then further according to the addition vector of multiple participles, Determine the Text eigenvector with contextual information.Wherein, specific identification device can be (i.e. above-mentioned by partial contextual information Position vector) it is first introduced into the vector of each participle, further according to the addition vector of multiple participles and certain machine learning mould Type (such as transform layer) obtains the Text eigenvector with contextual information, is just provided with complete contextual information.And by In the participle that multiple participles of text to be identified are a sequences, each participle has corresponding position, specific identification in the sequence Device can determine each point according to the position of each participle in the sequence and the functional relation of preset calculating position vector Word corresponding position vector in text to be identified.

Further, content identification apparatus to contextual information Text eigenvector carry out feature extraction when, Feature can be carried out to the Text eigenvector with contextual information and mentioned according to the feature extraction strategy of a sliding window It takes.

Such as shown in Fig. 3, the Text eigenvector with contextual information includes the vector of 20 participles, respectively vector 1 to 20, the length according to sliding window is 3, and sliding step is 3, and initial position is the feature extraction strategy of 0 (i.e. △=0), Feature extraction is carried out to the Text eigenvector with contextual information, for example extracts a vector as special from vector 1 to 3 Subvector 1 is levied, a vector is extracted from vector 4 to 6 as feature subvector 2 ... ..., one is extracted from vector 16 to 18 Vector is as feature subvector 6, in this way, available 6 feature subvectors, i.e. feature subvector 1 to 6.

It should be noted that content identification apparatus can divide according to the feature extraction strategy of one or more sliding windows The other Text eigenvector to contextual information carries out feature extraction, obtains multiple groups feature subvector, every group of feature to Amount includes multiple feature subvectors, and corresponds to a kind of feature extraction strategy of sliding window.

Such as shown in Fig. 3, content identification apparatus is 3 according to the length of sliding window, and sliding step is 3, initial position For the feature extraction strategy of 1 (i.e. △=1), a vector is extracted from vector 2 to 4 as feature subvector 1, from vector 5 to 7 One vector of middle extraction is as feature subvector 2 ... ..., a vector is extracted from vector 17 to 19 as feature subvector 6, In this way, available 6 feature subvectors.

Length according to sliding window is 3, and sliding step is 3, and initial position is the feature extraction plan of 2 (i.e. △=2) Slightly, a vector is extracted from vector 3 to 5 as feature subvector 1, and a vector is extracted from vector 6 to 8 as feature Vector 2 ... ... extracts a vector as feature subvector 6, available 6 feature subvectors from vector 18 to 20.

Step 103, according to above-mentioned multiple feature subvectors and preset machine learning model, determine that text to be identified includes The area information of specific content.Here area information can specifically include the region in text to be identified position (such as Initial position and final position etc.) and the region the information such as size.

Here, the operation logic of machine learning model is to be arranged in content identification apparatus in advance, can pass through training Sample training obtains, and specifically for multiple feature subvectors according to any text, exporting any text includes specific content Area information may include disaggregated model or regression model etc..

(1) if the machine learning model can be disaggregated model, input can be multiple feature subvectors, output For multiple probabilistic informations comprising specific content, if probability is greater than a certain threshold value, it is determined that for comprising specific content, otherwise then Not comprising specific content.Wherein, the corresponding feature subvector of each probabilistic information of output, thus in corresponding text to be identified Certain a part of text.

In this case, content identification apparatus can determine above-mentioned multiple feature subvectors first according to the disaggregated model It is corresponding whether include specific content result information；Then further according to multiple result informations and feature extraction strategy, really Fixed text to be identified includes the area information of specific content.

Specifically, content identification apparatus can be first according to feature extraction strategy, by multiple result informations and text to be identified In multiple regions corresponded to；Then it further according to the corresponding result information of text to be identified of each region, determines to be identified Text includes the area information of specific content.

Wherein, if feature extraction strategy include: according to a kind of sliding window in text to be identified all participles to Amount carries out feature extraction, then content identification apparatus can be according to the initial position of the sliding window, the length and cunning of sliding window Dynamic step-length, determines the region of the corresponding text to be identified of each sliding window.

For example, in the case where above-mentioned △ shown in Fig. 3=2, then when sliding window slides into participle vector 3 to 5, institute's table The 3 to 5th participle, obtains feature subvector 1 after carrying out feature extraction, according to above-mentioned steps 103 in the text to be identified shown Obtain result information 1, in this way, by result information 1 in text to be identified 3 to 5th participle carry out it is corresponding.If result information 1 is not comprising specific content, then 3 to 5th participle is not include specific content in file to be identified.

It should be noted that if content identification apparatus when executing above-mentioned steps 102, according to the spy of a variety of sliding windows Sign extracts strategy, carries out feature extraction to the Text eigenvector with contextual information respectively, obtains multiple groups feature subvector. Then when executing this step 103, according to every group of feature subvector and preset machine learning model, multiple groups region letter is determined respectively Breath；Then determine that the text to be identified includes the final area information of specific content further according to multiple groups area information.Wherein, it obtains To any two groups of area informations represented by can have repetition, overlapping or continuous between region, in this way, determining final area When information, continuous region can be merged.

For example, the 1st group of area information determined indicates region 1,2 and 3, and the 2nd group of area information indicates region 1 and 4, In, region 2 and 4 is continuous region, and compositing area 5, then the final area information obtained are the tool in region 1,5 and region 3 Body information.

(2) if above-mentioned preset machine learning model or regression model, input be above-mentioned multiple feature to Amount, output are the location information of multiple groups specific content, and every group of location information includes in text represented by this feature subvector Initial position and final position comprising characteristic character.

As it can be seen that in the method for the present embodiment, content identification apparatus can be according to preset feature extraction strategy and to be identified The Text eigenvector of text obtains the corresponding multiple feature subvectors of text to be identified, then further according to multiple feature to Amount and preset machine learning model determine that text to be identified includes the area information of specific content.The method of the present embodiment is not It needs by artificial, can automatically identify whether comprising specific content in text to be identified, if comprising can accurately determine Position goes out the region comprising specific content, so that the filtering to specific content is more accurate.

It should be noted that above-mentioned steps 101 to 103 can be realized by specific content identification model, specific content Identification model includes: that vector obtains module and above-mentioned machine learning model.In a specific embodiment, content recognition fills Specific content identification model can be trained in accordance with the following steps by setting, and flow chart is as shown in Figure 4, comprising:

Step 201, the initial model of specific content identification is determined.

It is appreciated that content identification apparatus when determining the initial model of specific content identification, can determine whether that specific content is known The initial value of preset parameter in multilayered structure included by other initial model and each layer mechanism specifically includes vector and obtains module And machine learning model, wherein vector obtains module for executing above-mentioned acquisition Text eigenvector and obtaining feature subvector The step of, i.e. step 101 and 102；Machine learning module is used to obtain multiple feature subvectors that module obtains according to vector, really Surely determine that text to be identified includes the area information of specific content, i.e. step 103.Specifically, the introductory die of specific content identification Multilayered structure in type can be following any algorithm structure: convolutional neural networks (Convolutional Neural Network, CNN), full convolutional neural networks (Fully Convolutional Networks for Semantic Segmentation, FCN) etc..

Wherein, preset parameter refers to that each layer structure in the initial model of specific content identification is used in calculating process Fixed, do not need the parameter of assignment at any time, such as weight, the parameters such as angle.

Step 202, training sample is determined, in training sample include multiple texts and whether each text includes specific content Markup information.

It further, can be in training sample in order to enable finally obtained specific content identification model is more accurate Including the markup information to the area information in each text including specific content, such as initial position and final position etc..

Step 203, the area in each text comprising specific content is determined respectively by the initial model that specific content identifies The initial results of domain information.

Specifically, the initial model identified by specific content is in determining training sample comprising specific interior in any text When the area information of appearance, the vector in initial model that can be identified by specific content obtains the text that module obtains any text Feature vector, and according to preset feature extraction strategy and Text eigenvector, obtain corresponding multiple feature of any text Vector；Then the area information in any text comprising specific content is determined by machine learning model again.

Step 204, the initial results determined according to the initial model that specific content in above-mentioned steps 203 identifies, and training Markup information in sample, the preset parameter value in the initial model of adjustment specific content identification, with obtain it is final it is specific in Hold identification model.

Specifically, content identification apparatus can be determined first according to the initial model that specific content in above-mentioned steps 203 identifies Markup information in initial results and training sample calculates loss function relevant to the initial model that specific content identifies, should The initial model that loss function is used to indicate specific content identification calculates the region that each text in training sample includes specific content The error of information.

Here, loss function includes: for indicating to wrap in the determining each text of the initial model identified according to specific content The result information of area information containing specific content, with text each in training sample it is practical whether comprising specific content (according to Markup information in training sample obtains) between difference.

The mathematics form of expression of these errors establishes loss function usually using cross entropy loss function, for example binary is handed over Entropy loss function (binary cross entropy loss) etc. is pitched, and the training process of specific content identification model exactly needs Reduce the value of above-mentioned error to the greatest extent, which is excellent by a series of mathematics such as backpropagation derivation and gradient decline Change means constantly optimize the parameter value of preset parameter in the initial model of the specific content identification determined in above-mentioned steps 201, And the calculated value of above-mentioned loss function is minimized.

Therefore, after loss function is calculated, content identification apparatus needs to be adjusted according to the loss function of calculating specific Preset parameter value in the initial model of content recognition, to obtain final specific content identification model.Specifically, if calculated Loss function functional value it is larger, for example be greater than preset value, then need to change preset parameter value, such as by some weight Weighted value reduction etc., so that the functional value of the loss function calculated according to preset parameter value adjusted reduces.

In addition, it is necessary to explanation, above-mentioned steps 203 to 204 are that the initial model identified by specific content calculates To initial results, the primary adjustment of the preset parameter value in the initial model identified according to initial results to specific content, and It in practical application, needs to execute above-mentioned steps 203 to 204 by constantly recycling, until the adjustment to preset parameter value meets Until certain stop condition.

Therefore, content identification apparatus is after performing above-described embodiment step 201 to 204, it is also necessary to which judgement is current right Whether the adjustment of preset parameter value meets preset stop condition, if it is satisfied, then terminating process；If conditions are not met, being then directed to The initial model of specific content identification after adjusting preset parameter value, returns and executes above-mentioned steps 203 to 204.

Wherein, preset stop condition includes but is not limited to any one of following condition: the fixed ginseng currently adjusted For the difference of numerical value and the preset parameter value of last adjustment less than a threshold value, that is, the preset parameter value adjusted reaches convergence；And it is right The adjustment number of preset parameter value is equal to preset number etc..

The content identification method in the present invention, the method master of the present embodiment are illustrated with specific application example below To be applied in instant messaging application in article recommendation function, specially " having a look at " function of wechat, and text to be identified is The article of wechat backstage push, above-mentioned content identification apparatus is specially wechat terminal, and specific content is specially copy.

When user's operation wechat terminal, so that wechat terminal opens article recommendation function, after wechat terminal can show wechat The article information, such as article title etc. of platform push, when the user clicks when a certain article information, wechat terminal can show this article Particular content.Since the quality of article particular content largely will affect the reading experience of user, sent out on wechat backstage The specific contents such as copy, interference reading can occur in different zones in the particular content of many articles sent.

Such as shown in Fig. 5, in the copy that the top of the article of display occurs, " preferential in limited time: barcode scanning can free body Test that makeups image open class is primary, the copy that phone 1234567 " and article bottom occur " please register enterprise at the first time Get in touch with us, leave the contact methods such as your phone, wechat " etc., therefore, it is necessary to wechat terminals to efficiently identify out article In the region comprising the specific contents such as copy be it is very crucial, so as to accurately filter out the advertisement of relevant range Text.

Specifically, the method in the present embodiment may include following two parts:

(1) with reference to as shown in fig. 6, wechat terminal or wechat backstage can train specific content to identify as follows Model, specially copy identification model in the present embodiment:

Step 301, the initial model of copy identification is determined.

Such as Fig. 7 show the structure of the initial model of copy identification, may include that vector obtains module and machine Learning model, wherein it may include: word segmentation module, transform layer (transformer) and feature extraction layer that vector, which obtains module, (feature extraction), and machine learning model is specially to classify layer (classifier).Wherein, word segmentation module is used for To input text carry out word segmentation processing after, obtain the vector of each participle, and form Text eigenvector, and by text feature to Amount is input to transform layer；Transform layer is used to obtain the Text eigenvector with contextual information according to Text eigenvector；It is special It levies extract layer to be used to carry out feature extraction to the Text eigenvector with contextual information, obtains multiple feature subvectors；Point Class layer is used to be classified according to multiple feature subvectors respectively, judges whether the text of corresponding region includes copy, and Output includes the probability of copy.

Further, transform layer may include: position encoded (Positional Encoding), head concern (Multi- Head Attention) and feedforward (Feed-Forward).Wherein, use is position encoded, allows in Text eigenvector each point The vector of word is provided with contextual information, avoid certain small fragments because certain several keyword occur so that local message excessively It is sensitive and the problem of cause copy identification model finally to be accidentally injured；In addition, copy can be allowed to identify using head concern Model focuses more on the position that copy often occurs, such as text top or bottom etc..It, should in specific application example Transform layer can repeat to add up repeatedly, or use universal transform layer (universal transformer) etc., and effect can be more It is good.

Feature extraction layer may include: convolutional layer (Conv1D), maximum pond layer (MaxPooling) and the maximum pond of biasing Change layer (Offset MaxPooling).Using biasing maximum pond layer, can after multiple convolution calculates again to calculating To vector cut, substantially increase the efficiency of operation in this way.On the other hand, it is obtained according to Text eigenvector adjacent Feature subvector between have overlapping so that contextual information keeps coherent.

Classification layer may include: convolutional layer, characteristics map (FeatureMap) and take maximum (Max), and can also include Abandon (gropout) layer.

It should be noted that full convolutional neural networks have been used in above-mentioned copy identification model, for elongated text This inputs available elongated output, there is following two advantages in this way: (1) side length is supported in the training of copy identification model Training, makes full use of the information of text in training sample, can support the input of random length text to be identified again in prediction； (2) convolutional network arithmetic speed is fast, facilitates parallelization, can carry out simultaneously for the multiple regions in entire text to be identified simultaneously Processing, it is high-efficient.

It is further to note that in other embodiments, machine learning model can also be regression model, it can be direct Export the initial position in text comprising copy and final position.

Step 302, it determines training sample, includes multiple texts in training sample, specially 100,000 plurality of articles and every Article whether include copy markup information.

It under normal circumstances, can be first by the article polishing (padding) in training sample to same length, but due to text The length variation range of chapter is very big, is generally likely to from tens words to several K words.If right if choosing regular length The article of short text may require that a large amount of polishing information, and can cut out many information for the article of long text leads to loss of learning.

Therefore, the method that elongated training can be taken in the present embodiment specifically can be first by portion a certain in training sample Then the article polishing divided sorts to the article of input according to length to same length, the article of equal length is put into same A batch (batch).Meanwhile limiting minimum and maximum article input length.

Step 303, determine that each piece article includes the region of copy respectively by the initial model that copy identifies The initial results of information.

Step 304, the initial results determined according to the initial model that copy in above-mentioned steps 303 identifies, and training Markup information in sample, the preset parameter value in initial model that adjustment copy identifies, to obtain final advertisement text Word identification model.

Specifically, the initial model identified based on copy can be calculated first according to initial results and markup information Then loss function adjusts the preset parameter value in initial model according to the loss function of calculating.

By the way that above-mentioned steps 303 and step 304, available final copy identification model is performed a plurality of times.

(2) as shown in figure 8, wechat terminal can be according to the copy identification model of above-mentioned training, and pass through as follows Step realizes the identification to copy in article to be identified (i.e. above-mentioned text to be identified):

Step 401, Text Pretreatment, such as additional character processing carried out to article to be identified, the conversion of English capital and small letter with And complicated and simple word unification etc..

Step 402, by the copy identification model of pretreated text output to above-mentioned training, elder generation is by therein Word segmentation module segments input text, and determines the vector of each participle, and form Text eigenvector.

It specifically, can be by the corresponding relationship of participle and vector preset in the participle and wechat terminal that input text, really The vector of fixed each participle, if a certain participle is inconsistent with the participle in corresponding relationship, can directly use additional character < The vector of unk > the indicate participle.

The elongated input of text is supported in the present embodiment, but minimum length is 81, maximum length 1025, if length is more than Maximum limitation be then truncated, on the contrary it is then end benefit blank character.Assuming that it is divided into w1, w2 ... for text is inputted by this step 402, Wk amounts to k word, and each participle is mapped as the real vector of 1 300 dimension, that is, inputs the matrix that text is mapped to k*300.

Step 403, the Text eigenvector that the transform layer in copy identification model is obtained according to above-mentioned steps 402, The specially matrix of 1*81*300, determining has the Text eigenvector of contextual information, wherein assuming that k is 81.

Specifically, the position vector of each participle of position encoded determination in transform layer is first passed through, and by each participle Vector is added with corresponding position vector, obtains the contextual information that vector after being added has part；Then pass through transform layer Head concern and feedforward, according to vector after being added generate have contextual information Text eigenvector, dimension 1* 81*300 can learn to obtain whole contextual informations.

Step 404, the feature extraction layer in copy identification model carries out feature extraction by the convolutional layer of 61 dimensions, Then multiple feature subvectors based on sliding window are generated by biasing maximum pond layer, such as shown in above-mentioned Fig. 3, be equivalent to After the Text eigenvector dissection with contextual information, the feature subvector of each small fragment is obtained.

Then the classification layer multiple feature subvectors being input to simultaneously in copy identification model.

Step 405, the classification layer in copy identification model judges above-mentioned input according to multiple feature subvectors respectively Whether corresponding region includes copy in text, and available each region includes the probability of copy, and exports one 1 The vector of dimension, i.e., each feature subvector it is corresponding whether include copy result information.The length of vector is by input text This length is determined that this is also in that we used the convolution kernels of 1x1 in last two layers of convolutional layer of classification layer.

Each value represents the probability that a region in input text includes copy in the elongated vector of output, and It is arranged in order.The output vector of fixed length in order to obtain has finally taken maximum value as output.

Step 406, wechat terminal result information according to obtained in above-mentioned steps 405 and above-mentioned copy identify mould In type feature extraction layer extract feature subvector when based on sliding window information, determine in article to be identified comprising advertisement text The area information of word, such as top or bottom etc., and the copy in article to be identified is filtered.

It should be noted that the article long for some length, due to the limitation of resource memory in wechat terminal, no This article can be directly directly inputted in above-mentioned copy identification model and be handled, it is necessary to first this article be carried out Repeatedly segmentation, by the corresponding text input of each segmentation into copy identification model, obtaining corresponding segment text includes advertisement The area information of text.And then the area information comprising copy in comprehensive each segmentation text, accurately to entire article In copy be filtered.

The embodiment of the present invention also provides a kind of content identification apparatus, and structural schematic diagram is as shown in figure 9, specifically can wrap It includes:

Vector acquiring unit 10, for obtaining the Text eigenvector of text to be identified.

The vector acquiring unit 10 is specifically used for the text to be identified carrying out word segmentation processing, obtains multiple participles；Root According to the corresponding relationship of preset participle and vector, the corresponding vector of the multiple participle is determined respectively；With the multiple participle point Not corresponding vector is combined into the Text eigenvector of the text to be identified.

Subvector acquiring unit 11, for what is obtained according to preset feature extraction strategy and the vector acquiring unit 10 Text eigenvector obtains the corresponding multiple feature subvectors of the text to be identified.

Subvector acquiring unit 11, specifically for determining the text with contextual information according to the Text eigenvector Feature vector；According to the preset feature extraction strategy, the Text eigenvector with contextual information is carried out special Sign is extracted, and multiple feature subvectors are obtained.

Wherein, subvector acquiring unit 11 is specifically used for true when determining the Text eigenvector with contextual information Fixed the multiple participle corresponding position vector in text to be identified；By the vector of the multiple participle respectively with it is corresponding Position vector be added, obtain the addition vector of multiple participles；According to the addition vector of the multiple participle, determining has up and down The Text eigenvector of literary information.

Area determination unit 12, multiple feature subvectors for being obtained according to the vector acquiring unit 11 and preset Machine learning model determines that the text to be identified includes the area information of specific content, the preset machine learning model For multiple feature subvectors according to any text, the area information that any text includes specific content is exported, it is described Preset machine learning module includes disaggregated model or regression model.

The area determination unit 12, if being disaggregated model specifically for preset machine learning model, according to described point Class model, determine the multiple feature subvector it is corresponding whether include specific content result information；According to described more A result information and feature extraction strategy determine that the text to be identified includes the area information of specific content.Wherein, region is true Order member 12 can be according to the feature extraction strategy, by the multiple regions in the multiple result information and the text to be identified It is corresponded to；According to the corresponding result information of text to be identified of each region, determine that the file to be identified includes in specific The area information of appearance.

It should be noted that if subvector acquiring unit 11 to the above-mentioned Text eigenvector with contextual information into It is the feature extraction strategy according to a variety of sliding windows, respectively to the text with contextual information when row feature extraction Feature vector carries out feature extraction, obtains multiple groups feature subvector, every group of feature subvector includes multiple feature subvectors；Then institute Area determination unit 12 is stated, specifically for being determined respectively according to every group of feature subvector and preset machine learning model Multiple groups area information determines that the text to be identified includes that the final area of specific content is believed according to the multiple groups area information Breath.

Further, the content identification apparatus of the present embodiment can also include: training unit 13, for determining specific content The initial model of identification, the initial model of the specific content identification include that vector obtains module and the machine learning model, The vector obtains the step of module is for executing the acquisition Text eigenvector and obtaining feature subvector；The engineering It practises module to be used to obtain multiple feature subvectors that module obtains according to the vector, determines that the text to be identified includes The area information of specific content；Determine training sample, include in the training sample multiple texts and each text whether include The markup information of specific content；The initial model identified by the specific content is determined in each text respectively comprising specific interior The initial results of the area information of appearance；The initial results and the instruction determined according to the initial model of specific content identification Practice the markup information in sample, adjusts the preset parameter value in the initial model of the specific content identification, it is final to obtain Specific content identification model.After the training of training unit 13 obtains specific content identification model, above-mentioned vector acquiring unit 10, son The specific content identification model that vector acquisition 11 and area determination unit 12 can be obtained by training unit 13, definitive result letter Breath.

Further, training unit 13, if being also used to the adjustment to the preset parameter value meets any as follows stop Only condition then stops the adjustment to the preset parameter value: preset number is equal to the adjustment number of the preset parameter value, The difference of the preset parameter value of the preset parameter value and last adjustment that currently adjust is less than a threshold value.

As it can be seen that subvector acquiring unit 11 can be according to preset feature extraction in the content identification apparatus of the present embodiment The Text eigenvector of text tactful and to be identified obtains the corresponding multiple feature subvectors of text to be identified, and then region is true Order member 12 determines that the text to be identified includes spy further according to the multiple feature subvector and preset machine learning model Determine the area information of content.This way it is not necessary to whether can be automatically identified in text to be identified by artificial comprising specific interior Hold, if comprising that the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.

The embodiment of the present invention also provides a kind of terminal device, and structural schematic diagram is as shown in Figure 10, which can be because Configuration or performance are different and generate bigger difference, may include one or more central processing units (central Processing units, CPU) 20 (for example, one or more processors) and memory 21, one or more are deposited Store up the storage medium 22 (such as one or more mass memory units) of application program 221 or data 222.Wherein, it stores Device 21 and storage medium 22 can be of short duration storage or persistent storage.Be stored in storage medium 22 program may include one or More than one module (diagram does not mark), each module may include to the series of instructions operation in terminal device.More into one Step ground, central processing unit 20 can be set to communicate with storage medium 22, execute one in storage medium 22 on the terminal device Series of instructions operation.

Specifically, application program of the application program 221 stored in storage medium 22 including content recognition, and the program It may include the vector acquiring unit 10 in above content identification device, subvector acquiring unit 11,12 He of area determination unit Training unit 13, herein without repeating.Further, central processing unit 20 can be set to communicate with storage medium 22, The corresponding sequence of operations of application program of the content recognition stored in storage medium 22 is executed on the terminal device.

Terminal device can also include one or more power supplys 23, one or more wired or wireless networks connect Mouth 24, one or more input/output interfaces 25, and/or, one or more operating systems 223, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by content identification apparatus described in above method embodiment can be shown in Fig. 10 based on this The structure of terminal device.

The embodiment of the present invention also provides a kind of storage medium, and the storage medium stores a plurality of instruction, and described instruction is suitable for It is loaded as processor and executes the content identification method as performed by above-mentioned content identification apparatus.

The embodiment of the present invention also provides a kind of terminal device, including pocessor and storage media, the processor, for real Existing each instruction；The storage medium is for storing a plurality of instruction, and described instruction is for being loaded by processor and being executed as above-mentioned Content identification method performed by content identification apparatus.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include: read-only memory (ROM), random access memory ram), disk or CD etc..

It is provided for the embodiments of the invention content identification method above and device is described in detail, it is used herein A specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand Method and its core concept of the invention；At the same time, for those skilled in the art is having according to the thought of the present invention There will be changes in body embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention Limitation.

Claims

1. a kind of content identification method characterized by comprising

Obtain the Text eigenvector of text to be identified；

According to preset feature extraction strategy and the Text eigenvector, the corresponding multiple features of the text to be identified are obtained Subvector；

According to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific content Area information, the preset machine learning model is used for according to multiple feature subvectors of any text, and output is described to appoint One text includes the area information of specific content, and the preset machine learning module includes disaggregated model or regression model.

2. the method as described in claim 1, which is characterized in that the Text eigenvector for obtaining text to be identified, specifically Include:

The text to be identified is subjected to word segmentation processing, obtains multiple participles；

According to the corresponding relationship of preset participle and vector, the corresponding vector of the multiple participle is determined respectively；

The Text eigenvector of the text to be identified is combined into the corresponding vector of the multiple participle.

3. the method as described in claim 1, which is characterized in that described special according to preset feature extraction strategy and the text Vector is levied, the corresponding multiple feature subvectors of the text to be identified is obtained, specifically includes:

The Text eigenvector with contextual information is determined according to the Text eigenvector；

According to the preset feature extraction strategy, feature is carried out to the Text eigenvector with contextual information and is mentioned It takes, obtains multiple feature subvectors.

4. method as claimed in claim 3, which is characterized in that the Text eigenvector includes the vector of multiple participles, then It is described that the Text eigenvector with contextual information is determined according to the Text eigenvector, it specifically includes:

Determine the multiple participle corresponding position vector in text to be identified；

The vector of the multiple participle is added with corresponding position vector respectively, obtains the addition vector of multiple participles；

According to the addition vector of the multiple participle, determining has the Text eigenvector of contextual information.

5. method as claimed in claim 3, which is characterized in that it is described according to the preset feature extraction strategy, to described Text eigenvector with contextual information carries out feature extraction, obtains multiple feature subvectors, specifically includes:

According to the feature extraction strategy of a variety of sliding windows, the Text eigenvector to described with contextual information is carried out respectively Feature extraction, obtains multiple groups feature subvector, and every group of feature subvector includes multiple feature subvectors；

It is then described according to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes spy The area information for determining content, specifically includes: according to every group of feature subvector and preset machine learning model, determining respectively Multiple groups area information determines that the text to be identified includes that the final area of specific content is believed according to the multiple groups area information Breath.

6. such as method described in any one of claim 1 to 5, which is characterized in that the machine learning model is disaggregated model, then It is described according to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific content Area information, specifically include:

According to the disaggregated model, determine the multiple feature subvector is corresponding whether believe by the result comprising specific content Breath；

According to the multiple result information and feature extraction strategy, determine that the text to be identified includes that the region of specific content is believed Breath.

7. method as claimed in claim 6, which is characterized in that described according to the multiple result information and feature extraction plan Slightly, it determines that the text to be identified includes the area information of specific content, specifically includes:

According to the feature extraction strategy, the multiple regions in the multiple result information and the text to be identified are carried out pair It answers；

According to the corresponding result information of text to be identified of each region, determine that the file to be identified includes the area of specific content Domain information.

8. such as method described in any one of claim 1 to 5, which is characterized in that the method also includes:

Determine that the initial model of specific content identification, the initial model of the specific content identification include that vector obtains module and institute Machine learning model is stated, the vector obtains module and is used to execute the acquisition Text eigenvector and obtains feature subvector Step；The machine learning module is used to obtain multiple feature subvectors that module obtains according to the vector, determines institute State the area information that text to be identified includes specific content；

Determine training sample, include in the training sample multiple texts and each text whether include specific content mark letter Breath；

The initial model identified by the specific content determines the area information in each text comprising specific content respectively Initial results；

Markup information in the initial results determined according to the initial model of specific content identification and the training sample, The preset parameter value in the initial model of the specific content identification is adjusted, to obtain final specific content identification model.

9. method according to claim 8, which is characterized in that if met to the adjustment of the preset parameter value following any Stop condition then stops the adjustment to the preset parameter value:

Preset number, the preset parameter value currently adjusted and last adjustment are equal to the adjustment number of the preset parameter value Preset parameter value difference less than a threshold value.

10. a kind of content identification apparatus characterized by comprising

Subvector acquiring unit, for obtaining described wait know according to preset feature extraction strategy and the Text eigenvector The corresponding multiple feature subvectors of other text；

Area determination unit, for determining described wait know according to the multiple feature subvector and preset machine learning model Other text includes the area information of specific content, and the preset machine learning model is used for multiple features according to any text Subvector, exports the area information that any text includes specific content, and the preset machine learning module includes classification Model or regression model.

11. a kind of storage medium, which is characterized in that the storage medium stores a plurality of instruction, and described instruction is suitable for by processor It loads and executes content identification method as described in any one of claim 1 to 9.

12. a kind of terminal device, which is characterized in that including pocessor and storage media, the processor, for realizing each finger It enables；

The storage medium is for storing a plurality of instruction, and described instruction by processor for being loaded and executing such as claim 1 to 9 Described in any item content identification methods.