CN110162628A - A kind of content identification method and device - Google Patents
A kind of content identification method and device Download PDFInfo
- Publication number
- CN110162628A CN110162628A CN201910369604.6A CN201910369604A CN110162628A CN 110162628 A CN110162628 A CN 110162628A CN 201910369604 A CN201910369604 A CN 201910369604A CN 110162628 A CN110162628 A CN 110162628A
- Authority
- CN
- China
- Prior art keywords
- text
- identified
- specific content
- vector
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000000605 extraction Methods 0.000 claims abstract description 55
- 238000010801 machine learning Methods 0.000 claims abstract description 42
- 239000013598 vector Substances 0.000 claims description 110
- 238000012549 training Methods 0.000 claims description 36
- 238000003860 storage Methods 0.000 claims description 24
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000013145 classification model Methods 0.000 claims 1
- 238000001914 filtration Methods 0.000 abstract description 6
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 8
- 239000000284 extract Substances 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000005498 polishing Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- TVEXGJYMHHTVKP-UHFFFAOYSA-N 6-oxabicyclo[3.2.1]oct-3-en-7-one Chemical compound C1C2C(=O)OC1C=CC2 TVEXGJYMHHTVKP-UHFFFAOYSA-N 0.000 description 1
- 241000208340 Araliaceae Species 0.000 description 1
- 206010028813 Nausea Diseases 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000002224 dissection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The embodiment of the invention discloses a kind of content identification method and devices, are applied to technical field of information processing.In the method for the present embodiment, content identification apparatus can be according to the Text eigenvector of preset feature extraction strategy and text to be identified, obtain the corresponding multiple feature subvectors of text to be identified, then further according to multiple feature subvectors and preset machine learning model, determine that text to be identified includes the area information of specific content.Whether the method for the present embodiment is not needed by artificial, can automatically identify comprising specific content in text to be identified, if comprising that the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.
Description
Technical field
The present invention relates to technical field of information processing, in particular to a kind of content identification method and device.
Background technique
At present in the text that background server is pushed to application terminal, it can wrap containing certain specific contents, such as extensively
Text or vulgar text etc. are accused, as such, it is desirable to filtering be carried out to the specific content in text, so that being shown in application terminal
Text will not include these specific contents.In this process, it is important that identify the specific content in text, ability
Accurately it is filtered.
A kind of traditional specific content recognition methods includes: by (such as the advertisement of the keyword of hand digging specific content
Registration hot line in text etc.), and relevant rule is set and text is identified, still, this method labor intensive,
And easily omit very much, efficiency is very low, is also easy to appear erroneous judgement.
Another specific content recognition methods includes: to enter text into machine sort model, to extract text feature,
And it is identified according to text feature.The method avoids manpower, recognition efficiency and accuracy rate are improved, but some
In text, being not is all specific content in the whole text, but the top in text, bottom or intermediate insertion are a bit of specific interior
Hold, such as copy, when we by entire text input to machine sort model when, due to the information content of these specific contents
The information content of more entire text, can be much smaller, in most cases can not effectively identify include in text specific
Content.
Summary of the invention
The embodiment of the present invention provides a kind of content identification method and device, realizes in automatic identification text to be identified and includes
The region of specific content.
The embodiment of the present invention provides a kind of content identification method, comprising:
Obtain the Text eigenvector of text to be identified;
According to preset feature extraction strategy and the Text eigenvector, it is corresponding multiple to obtain the text to be identified
Feature subvector;
According to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific
The area information of content, the preset machine learning model are used for multiple feature subvectors according to any text, export institute
The area information that any text includes specific content is stated, the preset machine learning module includes disaggregated model or recurrence mould
Type.
The embodiment of the present invention provides a kind of content identification apparatus, comprising:
Vector acquiring unit, for obtaining the Text eigenvector of text to be identified;
Subvector acquiring unit, for according to preset feature extraction strategy and the Text eigenvector, described in acquisition
The corresponding multiple feature subvectors of text to be identified;
Area determination unit, described in determining according to the multiple feature subvector and preset machine learning model
Text to be identified includes the area information of specific content, and the preset machine learning model is used for according to the multiple of any text
Feature subvector, exports the area information that any text includes specific content, and the preset machine learning module includes
Disaggregated model or regression model.
The third aspect of the embodiment of the present invention provides a kind of storage medium, and the storage medium stores a plurality of instruction, the finger
It enables and being suitable for as processor loads and executes the content identification method as described in first aspect of the embodiment of the present invention.
Fourth aspect of the embodiment of the present invention provides a kind of terminal device, including pocessor and storage media, the processor,
For realizing each instruction;
The storage medium is for storing a plurality of instruction, and described instruction is for being loaded by processor and being executed as of the invention real
Apply content identification method described in a first aspect.
As it can be seen that in the method for the present embodiment, content identification apparatus can be according to preset feature extraction strategy and to be identified
The Text eigenvector of text obtains the corresponding multiple feature subvectors of text to be identified, then further according to the multiple feature
Subvector and preset machine learning model determine that the text to be identified includes the area information of specific content.The present embodiment
Method do not need by artificial, whether can automatically identify in text to be identified comprising specific content, if comprising can be with
The region comprising specific content is precisely located out, so that the filtering to specific content is more accurate.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art
To obtain other drawings based on these drawings.
Fig. 1 is a kind of schematic diagram of content identification method provided in an embodiment of the present invention;
Fig. 2 is a kind of flow chart of content identification method provided by one embodiment of the present invention;
Fig. 3 is the signal for carrying out feature extraction in the embodiment of the present invention to the Text eigenvector with contextual information
Figure;
Fig. 4 is the method flow diagram of training specific content identification model in the embodiment of the present invention;
Fig. 5 is the schematic diagram for the article content that wechat terminal is shown in Application Example of the present invention;
Fig. 6 is the schematic diagram of training copy identification model in Application Example of the present invention;
Fig. 7 is the structural schematic diagram of the initial model of the copy identification determined in Application Example of the present invention;
Fig. 8 is the flow chart of the copy recognition methods provided in Application Example of the present invention;
Fig. 9 is a kind of structural schematic diagram of content identification apparatus provided in an embodiment of the present invention;
Figure 10 is a kind of structural schematic diagram of terminal device provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Description and claims of this specification and term " first ", " second ", " third " " in above-mentioned attached drawing
The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage
The data that solution uses in this way are interchangeable under appropriate circumstances, so that the embodiment of the present invention described herein for example can be to remove
Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any
Deformation, it is intended that cover not exclusively include, for example, containing the process, method of a series of steps or units, system, production
Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this
A little process, methods, the other step or units of product or equipment inherently.
The embodiment of the present invention provides a kind of content identification method, as shown in Figure 1, can be passed through by content identification apparatus as follows
The identification of step realization specific content:
Obtain the Text eigenvector of text to be identified;According to preset feature extraction strategy and the text feature to
Amount obtains the corresponding multiple feature subvectors (illustrating for n in Fig. 1) of the text to be identified;According to the multiple spy
Subvector and preset machine learning model are levied, determines that the text to be identified includes the area information of specific content.
Specific content in the present embodiment refers to one or more characters with special characteristic, such as copy, low
Custom or nauseous text or specific character etc..
This way it is not necessary to can be automatically identified by artificial whether comprising specific content in text to be identified, if packet
Contain, the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.
The embodiment of the present invention provides a kind of content identification method, mainly the method as performed by content identification apparatus, stream
Journey figure is as shown in Figure 2, comprising:
Step 101, the Text eigenvector of text to be identified is obtained.
It is appreciated that the method for the present embodiment can be applied in several scenes, such as when background server is whole to application
When the push text of end, background server can initiate the process of the present embodiment, at this using the push text as text to be identified
In the case of kind, background server is content identification apparatus;Alternatively, after background server pushes text to application terminal, when answering
After receiving push text with terminal, the process of the present embodiment can be initiated, at this using the push text as text to be identified
In the case of kind, application terminal is content identification apparatus etc..As it can be seen that the triggering due to the difference of application scenarios, to this implementing procedure
Mode is also different, herein to the application scenarios of the present embodiment method without limiting.
Specifically, content identification apparatus, can be according to certain plan when obtaining the Text eigenvector of text to be identified
Slightly obtaining, in a kind of situation, text to be identified first can be carried out word segmentation processing by content identification apparatus, it obtain multiple participles,
And according to the corresponding relationship of preset participle and vector, the corresponding vector of multiple participles is determined respectively;Finally again with multiple participles
Corresponding vector is combined into the Text eigenvector of text to be identified.
Wherein, the corresponding relationship of preset participle and vector is to be arranged in content identification apparatus in advance, is indicated each
Corresponding vector is segmented, can be obtained according to a large amount of samples of text, specifically, multiple samples of text are divided respectively
Word, and multiple participles of the frequency of occurrences in multiple samples of text higher (for example the frequency of occurrences is greater than a certain threshold value) are counted, and
The vector of this multiple participle is respectively set, the corresponding relationship respectively with vector is stored into content identification apparatus.
In this case, content identification apparatus, can be by point of text to be identified during determining Text eigenvector
Word is compared with the participle in preset corresponding relationship, so that it is determined that the vector of participle.If a certain participle of text to be identified
It is all inconsistent with all participles in corresponding relationship, it can determine that the corresponding vector of the participle is a certain specific vector, this is specific
Vector can be to be stored into content identification apparatus in advance.
It should be noted that in other cases, content identification apparatus can also use other coding methods, obtain wait know
The Text eigenvector of other text, herein without repeating.
Step 102, according to preset feature extraction strategy and above-mentioned Text eigenvector, it is corresponding to obtain text to be identified
Multiple feature subvectors.This multiple feature subvector can indicate entire text to be identified after combining, i.e. feature to
Amount can indicate some region of text in text to be identified, and can have the information of overlapping between adjacent feature subvector.
Specifically, in a kind of situation, content identification apparatus can directly according to the feature extraction strategy of certain sliding window,
Feature extraction is carried out to Text eigenvector, obtains multiple feature subvectors.If Text eigenvector includes multiple participles
Vector, when carrying out feature extraction, can in the vector of all participles of each sliding window, extract a certain participle to
Amount, obtains a feature subvector.Wherein, the length of each sliding window and sliding step can be the same or different.
In another case, specific identification device can first determine the text with contextual information according to Text eigenvector
Feature vector carries out feature to the Text eigenvector with contextual information then according still further to preset feature extraction strategy
It extracts, obtains multiple feature subvectors.
Wherein, if Text eigenvector includes the vector of multiple participles, specific identification device is determining have up and down
When the Text eigenvector of literary information, the text feature with contextual information can be directly determined according to the vector of multiple participles
Vector.It can also first determine multiple participle corresponding position vectors in text to be identified, and by the vector of multiple participles
It is added respectively with corresponding position vector, obtains the addition vector of multiple participles, then further according to the addition vector of multiple participles,
Determine the Text eigenvector with contextual information.Wherein, specific identification device can be (i.e. above-mentioned by partial contextual information
Position vector) it is first introduced into the vector of each participle, further according to the addition vector of multiple participles and certain machine learning mould
Type (such as transform layer) obtains the Text eigenvector with contextual information, is just provided with complete contextual information.And by
In the participle that multiple participles of text to be identified are a sequences, each participle has corresponding position, specific identification in the sequence
Device can determine each point according to the position of each participle in the sequence and the functional relation of preset calculating position vector
Word corresponding position vector in text to be identified.
Further, content identification apparatus to contextual information Text eigenvector carry out feature extraction when,
Feature can be carried out to the Text eigenvector with contextual information and mentioned according to the feature extraction strategy of a sliding window
It takes.
Such as shown in Fig. 3, the Text eigenvector with contextual information includes the vector of 20 participles, respectively vector
1 to 20, the length according to sliding window is 3, and sliding step is 3, and initial position is the feature extraction strategy of 0 (i.e. △=0),
Feature extraction is carried out to the Text eigenvector with contextual information, for example extracts a vector as special from vector 1 to 3
Subvector 1 is levied, a vector is extracted from vector 4 to 6 as feature subvector 2 ... ..., one is extracted from vector 16 to 18
Vector is as feature subvector 6, in this way, available 6 feature subvectors, i.e. feature subvector 1 to 6.
It should be noted that content identification apparatus can divide according to the feature extraction strategy of one or more sliding windows
The other Text eigenvector to contextual information carries out feature extraction, obtains multiple groups feature subvector, every group of feature to
Amount includes multiple feature subvectors, and corresponds to a kind of feature extraction strategy of sliding window.
Such as shown in Fig. 3, content identification apparatus is 3 according to the length of sliding window, and sliding step is 3, initial position
For the feature extraction strategy of 1 (i.e. △=1), a vector is extracted from vector 2 to 4 as feature subvector 1, from vector 5 to 7
One vector of middle extraction is as feature subvector 2 ... ..., a vector is extracted from vector 17 to 19 as feature subvector 6,
In this way, available 6 feature subvectors.
Length according to sliding window is 3, and sliding step is 3, and initial position is the feature extraction plan of 2 (i.e. △=2)
Slightly, a vector is extracted from vector 3 to 5 as feature subvector 1, and a vector is extracted from vector 6 to 8 as feature
Vector 2 ... ... extracts a vector as feature subvector 6, available 6 feature subvectors from vector 18 to 20.
Step 103, according to above-mentioned multiple feature subvectors and preset machine learning model, determine that text to be identified includes
The area information of specific content.Here area information can specifically include the region in text to be identified position (such as
Initial position and final position etc.) and the region the information such as size.
Here, the operation logic of machine learning model is to be arranged in content identification apparatus in advance, can pass through training
Sample training obtains, and specifically for multiple feature subvectors according to any text, exporting any text includes specific content
Area information may include disaggregated model or regression model etc..
(1) if the machine learning model can be disaggregated model, input can be multiple feature subvectors, output
For multiple probabilistic informations comprising specific content, if probability is greater than a certain threshold value, it is determined that for comprising specific content, otherwise then
Not comprising specific content.Wherein, the corresponding feature subvector of each probabilistic information of output, thus in corresponding text to be identified
Certain a part of text.
In this case, content identification apparatus can determine above-mentioned multiple feature subvectors first according to the disaggregated model
It is corresponding whether include specific content result information;Then further according to multiple result informations and feature extraction strategy, really
Fixed text to be identified includes the area information of specific content.
Specifically, content identification apparatus can be first according to feature extraction strategy, by multiple result informations and text to be identified
In multiple regions corresponded to;Then it further according to the corresponding result information of text to be identified of each region, determines to be identified
Text includes the area information of specific content.
Wherein, if feature extraction strategy include: according to a kind of sliding window in text to be identified all participles to
Amount carries out feature extraction, then content identification apparatus can be according to the initial position of the sliding window, the length and cunning of sliding window
Dynamic step-length, determines the region of the corresponding text to be identified of each sliding window.
For example, in the case where above-mentioned △ shown in Fig. 3=2, then when sliding window slides into participle vector 3 to 5, institute's table
The 3 to 5th participle, obtains feature subvector 1 after carrying out feature extraction, according to above-mentioned steps 103 in the text to be identified shown
Obtain result information 1, in this way, by result information 1 in text to be identified 3 to 5th participle carry out it is corresponding.If result information
1 is not comprising specific content, then 3 to 5th participle is not include specific content in file to be identified.
It should be noted that if content identification apparatus when executing above-mentioned steps 102, according to the spy of a variety of sliding windows
Sign extracts strategy, carries out feature extraction to the Text eigenvector with contextual information respectively, obtains multiple groups feature subvector.
Then when executing this step 103, according to every group of feature subvector and preset machine learning model, multiple groups region letter is determined respectively
Breath;Then determine that the text to be identified includes the final area information of specific content further according to multiple groups area information.Wherein, it obtains
To any two groups of area informations represented by can have repetition, overlapping or continuous between region, in this way, determining final area
When information, continuous region can be merged.
For example, the 1st group of area information determined indicates region 1,2 and 3, and the 2nd group of area information indicates region 1 and 4,
In, region 2 and 4 is continuous region, and compositing area 5, then the final area information obtained are the tool in region 1,5 and region 3
Body information.
(2) if above-mentioned preset machine learning model or regression model, input be above-mentioned multiple feature to
Amount, output are the location information of multiple groups specific content, and every group of location information includes in text represented by this feature subvector
Initial position and final position comprising characteristic character.
As it can be seen that in the method for the present embodiment, content identification apparatus can be according to preset feature extraction strategy and to be identified
The Text eigenvector of text obtains the corresponding multiple feature subvectors of text to be identified, then further according to multiple feature to
Amount and preset machine learning model determine that text to be identified includes the area information of specific content.The method of the present embodiment is not
It needs by artificial, can automatically identify whether comprising specific content in text to be identified, if comprising can accurately determine
Position goes out the region comprising specific content, so that the filtering to specific content is more accurate.
It should be noted that above-mentioned steps 101 to 103 can be realized by specific content identification model, specific content
Identification model includes: that vector obtains module and above-mentioned machine learning model.In a specific embodiment, content recognition fills
Specific content identification model can be trained in accordance with the following steps by setting, and flow chart is as shown in Figure 4, comprising:
Step 201, the initial model of specific content identification is determined.
It is appreciated that content identification apparatus when determining the initial model of specific content identification, can determine whether that specific content is known
The initial value of preset parameter in multilayered structure included by other initial model and each layer mechanism specifically includes vector and obtains module
And machine learning model, wherein vector obtains module for executing above-mentioned acquisition Text eigenvector and obtaining feature subvector
The step of, i.e. step 101 and 102;Machine learning module is used to obtain multiple feature subvectors that module obtains according to vector, really
Surely determine that text to be identified includes the area information of specific content, i.e. step 103.Specifically, the introductory die of specific content identification
Multilayered structure in type can be following any algorithm structure: convolutional neural networks (Convolutional Neural
Network, CNN), full convolutional neural networks (Fully Convolutional Networks for Semantic
Segmentation, FCN) etc..
Wherein, preset parameter refers to that each layer structure in the initial model of specific content identification is used in calculating process
Fixed, do not need the parameter of assignment at any time, such as weight, the parameters such as angle.
Step 202, training sample is determined, in training sample include multiple texts and whether each text includes specific content
Markup information.
It further, can be in training sample in order to enable finally obtained specific content identification model is more accurate
Including the markup information to the area information in each text including specific content, such as initial position and final position etc..
Step 203, the area in each text comprising specific content is determined respectively by the initial model that specific content identifies
The initial results of domain information.
Specifically, the initial model identified by specific content is in determining training sample comprising specific interior in any text
When the area information of appearance, the vector in initial model that can be identified by specific content obtains the text that module obtains any text
Feature vector, and according to preset feature extraction strategy and Text eigenvector, obtain corresponding multiple feature of any text
Vector;Then the area information in any text comprising specific content is determined by machine learning model again.
Step 204, the initial results determined according to the initial model that specific content in above-mentioned steps 203 identifies, and training
Markup information in sample, the preset parameter value in the initial model of adjustment specific content identification, with obtain it is final it is specific in
Hold identification model.
Specifically, content identification apparatus can be determined first according to the initial model that specific content in above-mentioned steps 203 identifies
Markup information in initial results and training sample calculates loss function relevant to the initial model that specific content identifies, should
The initial model that loss function is used to indicate specific content identification calculates the region that each text in training sample includes specific content
The error of information.
Here, loss function includes: for indicating to wrap in the determining each text of the initial model identified according to specific content
The result information of area information containing specific content, with text each in training sample it is practical whether comprising specific content (according to
Markup information in training sample obtains) between difference.
The mathematics form of expression of these errors establishes loss function usually using cross entropy loss function, for example binary is handed over
Entropy loss function (binary cross entropy loss) etc. is pitched, and the training process of specific content identification model exactly needs
Reduce the value of above-mentioned error to the greatest extent, which is excellent by a series of mathematics such as backpropagation derivation and gradient decline
Change means constantly optimize the parameter value of preset parameter in the initial model of the specific content identification determined in above-mentioned steps 201,
And the calculated value of above-mentioned loss function is minimized.
Therefore, after loss function is calculated, content identification apparatus needs to be adjusted according to the loss function of calculating specific
Preset parameter value in the initial model of content recognition, to obtain final specific content identification model.Specifically, if calculated
Loss function functional value it is larger, for example be greater than preset value, then need to change preset parameter value, such as by some weight
Weighted value reduction etc., so that the functional value of the loss function calculated according to preset parameter value adjusted reduces.
In addition, it is necessary to explanation, above-mentioned steps 203 to 204 are that the initial model identified by specific content calculates
To initial results, the primary adjustment of the preset parameter value in the initial model identified according to initial results to specific content, and
It in practical application, needs to execute above-mentioned steps 203 to 204 by constantly recycling, until the adjustment to preset parameter value meets
Until certain stop condition.
Therefore, content identification apparatus is after performing above-described embodiment step 201 to 204, it is also necessary to which judgement is current right
Whether the adjustment of preset parameter value meets preset stop condition, if it is satisfied, then terminating process;If conditions are not met, being then directed to
The initial model of specific content identification after adjusting preset parameter value, returns and executes above-mentioned steps 203 to 204.
Wherein, preset stop condition includes but is not limited to any one of following condition: the fixed ginseng currently adjusted
For the difference of numerical value and the preset parameter value of last adjustment less than a threshold value, that is, the preset parameter value adjusted reaches convergence;And it is right
The adjustment number of preset parameter value is equal to preset number etc..
The content identification method in the present invention, the method master of the present embodiment are illustrated with specific application example below
To be applied in instant messaging application in article recommendation function, specially " having a look at " function of wechat, and text to be identified is
The article of wechat backstage push, above-mentioned content identification apparatus is specially wechat terminal, and specific content is specially copy.
When user's operation wechat terminal, so that wechat terminal opens article recommendation function, after wechat terminal can show wechat
The article information, such as article title etc. of platform push, when the user clicks when a certain article information, wechat terminal can show this article
Particular content.Since the quality of article particular content largely will affect the reading experience of user, sent out on wechat backstage
The specific contents such as copy, interference reading can occur in different zones in the particular content of many articles sent.
Such as shown in Fig. 5, in the copy that the top of the article of display occurs, " preferential in limited time: barcode scanning can free body
Test that makeups image open class is primary, the copy that phone 1234567 " and article bottom occur " please register enterprise at the first time
Get in touch with us, leave the contact methods such as your phone, wechat " etc., therefore, it is necessary to wechat terminals to efficiently identify out article
In the region comprising the specific contents such as copy be it is very crucial, so as to accurately filter out the advertisement of relevant range
Text.
Specifically, the method in the present embodiment may include following two parts:
(1) with reference to as shown in fig. 6, wechat terminal or wechat backstage can train specific content to identify as follows
Model, specially copy identification model in the present embodiment:
Step 301, the initial model of copy identification is determined.
Such as Fig. 7 show the structure of the initial model of copy identification, may include that vector obtains module and machine
Learning model, wherein it may include: word segmentation module, transform layer (transformer) and feature extraction layer that vector, which obtains module,
(feature extraction), and machine learning model is specially to classify layer (classifier).Wherein, word segmentation module is used for
To input text carry out word segmentation processing after, obtain the vector of each participle, and form Text eigenvector, and by text feature to
Amount is input to transform layer;Transform layer is used to obtain the Text eigenvector with contextual information according to Text eigenvector;It is special
It levies extract layer to be used to carry out feature extraction to the Text eigenvector with contextual information, obtains multiple feature subvectors;Point
Class layer is used to be classified according to multiple feature subvectors respectively, judges whether the text of corresponding region includes copy, and
Output includes the probability of copy.
Further, transform layer may include: position encoded (Positional Encoding), head concern (Multi-
Head Attention) and feedforward (Feed-Forward).Wherein, use is position encoded, allows in Text eigenvector each point
The vector of word is provided with contextual information, avoid certain small fragments because certain several keyword occur so that local message excessively
It is sensitive and the problem of cause copy identification model finally to be accidentally injured;In addition, copy can be allowed to identify using head concern
Model focuses more on the position that copy often occurs, such as text top or bottom etc..It, should in specific application example
Transform layer can repeat to add up repeatedly, or use universal transform layer (universal transformer) etc., and effect can be more
It is good.
Feature extraction layer may include: convolutional layer (Conv1D), maximum pond layer (MaxPooling) and the maximum pond of biasing
Change layer (Offset MaxPooling).Using biasing maximum pond layer, can after multiple convolution calculates again to calculating
To vector cut, substantially increase the efficiency of operation in this way.On the other hand, it is obtained according to Text eigenvector adjacent
Feature subvector between have overlapping so that contextual information keeps coherent.
Classification layer may include: convolutional layer, characteristics map (FeatureMap) and take maximum (Max), and can also include
Abandon (gropout) layer.
It should be noted that full convolutional neural networks have been used in above-mentioned copy identification model, for elongated text
This inputs available elongated output, there is following two advantages in this way: (1) side length is supported in the training of copy identification model
Training, makes full use of the information of text in training sample, can support the input of random length text to be identified again in prediction;
(2) convolutional network arithmetic speed is fast, facilitates parallelization, can carry out simultaneously for the multiple regions in entire text to be identified simultaneously
Processing, it is high-efficient.
It is further to note that in other embodiments, machine learning model can also be regression model, it can be direct
Export the initial position in text comprising copy and final position.
Step 302, it determines training sample, includes multiple texts in training sample, specially 100,000 plurality of articles and every
Article whether include copy markup information.
It under normal circumstances, can be first by the article polishing (padding) in training sample to same length, but due to text
The length variation range of chapter is very big, is generally likely to from tens words to several K words.If right if choosing regular length
The article of short text may require that a large amount of polishing information, and can cut out many information for the article of long text leads to loss of learning.
Therefore, the method that elongated training can be taken in the present embodiment specifically can be first by portion a certain in training sample
Then the article polishing divided sorts to the article of input according to length to same length, the article of equal length is put into same
A batch (batch).Meanwhile limiting minimum and maximum article input length.
Step 303, determine that each piece article includes the region of copy respectively by the initial model that copy identifies
The initial results of information.
Step 304, the initial results determined according to the initial model that copy in above-mentioned steps 303 identifies, and training
Markup information in sample, the preset parameter value in initial model that adjustment copy identifies, to obtain final advertisement text
Word identification model.
Specifically, the initial model identified based on copy can be calculated first according to initial results and markup information
Then loss function adjusts the preset parameter value in initial model according to the loss function of calculating.
By the way that above-mentioned steps 303 and step 304, available final copy identification model is performed a plurality of times.
(2) as shown in figure 8, wechat terminal can be according to the copy identification model of above-mentioned training, and pass through as follows
Step realizes the identification to copy in article to be identified (i.e. above-mentioned text to be identified):
Step 401, Text Pretreatment, such as additional character processing carried out to article to be identified, the conversion of English capital and small letter with
And complicated and simple word unification etc..
Step 402, by the copy identification model of pretreated text output to above-mentioned training, elder generation is by therein
Word segmentation module segments input text, and determines the vector of each participle, and form Text eigenvector.
It specifically, can be by the corresponding relationship of participle and vector preset in the participle and wechat terminal that input text, really
The vector of fixed each participle, if a certain participle is inconsistent with the participle in corresponding relationship, can directly use additional character <
The vector of unk > the indicate participle.
The elongated input of text is supported in the present embodiment, but minimum length is 81, maximum length 1025, if length is more than
Maximum limitation be then truncated, on the contrary it is then end benefit blank character.Assuming that it is divided into w1, w2 ... for text is inputted by this step 402,
Wk amounts to k word, and each participle is mapped as the real vector of 1 300 dimension, that is, inputs the matrix that text is mapped to k*300.
Step 403, the Text eigenvector that the transform layer in copy identification model is obtained according to above-mentioned steps 402,
The specially matrix of 1*81*300, determining has the Text eigenvector of contextual information, wherein assuming that k is 81.
Specifically, the position vector of each participle of position encoded determination in transform layer is first passed through, and by each participle
Vector is added with corresponding position vector, obtains the contextual information that vector after being added has part;Then pass through transform layer
Head concern and feedforward, according to vector after being added generate have contextual information Text eigenvector, dimension 1*
81*300 can learn to obtain whole contextual informations.
Step 404, the feature extraction layer in copy identification model carries out feature extraction by the convolutional layer of 61 dimensions,
Then multiple feature subvectors based on sliding window are generated by biasing maximum pond layer, such as shown in above-mentioned Fig. 3, be equivalent to
After the Text eigenvector dissection with contextual information, the feature subvector of each small fragment is obtained.
Then the classification layer multiple feature subvectors being input to simultaneously in copy identification model.
Step 405, the classification layer in copy identification model judges above-mentioned input according to multiple feature subvectors respectively
Whether corresponding region includes copy in text, and available each region includes the probability of copy, and exports one 1
The vector of dimension, i.e., each feature subvector it is corresponding whether include copy result information.The length of vector is by input text
This length is determined that this is also in that we used the convolution kernels of 1x1 in last two layers of convolutional layer of classification layer.
Each value represents the probability that a region in input text includes copy in the elongated vector of output, and
It is arranged in order.The output vector of fixed length in order to obtain has finally taken maximum value as output.
Step 406, wechat terminal result information according to obtained in above-mentioned steps 405 and above-mentioned copy identify mould
In type feature extraction layer extract feature subvector when based on sliding window information, determine in article to be identified comprising advertisement text
The area information of word, such as top or bottom etc., and the copy in article to be identified is filtered.
It should be noted that the article long for some length, due to the limitation of resource memory in wechat terminal, no
This article can be directly directly inputted in above-mentioned copy identification model and be handled, it is necessary to first this article be carried out
Repeatedly segmentation, by the corresponding text input of each segmentation into copy identification model, obtaining corresponding segment text includes advertisement
The area information of text.And then the area information comprising copy in comprehensive each segmentation text, accurately to entire article
In copy be filtered.
The embodiment of the present invention also provides a kind of content identification apparatus, and structural schematic diagram is as shown in figure 9, specifically can wrap
It includes:
Vector acquiring unit 10, for obtaining the Text eigenvector of text to be identified.
The vector acquiring unit 10 is specifically used for the text to be identified carrying out word segmentation processing, obtains multiple participles;Root
According to the corresponding relationship of preset participle and vector, the corresponding vector of the multiple participle is determined respectively;With the multiple participle point
Not corresponding vector is combined into the Text eigenvector of the text to be identified.
Subvector acquiring unit 11, for what is obtained according to preset feature extraction strategy and the vector acquiring unit 10
Text eigenvector obtains the corresponding multiple feature subvectors of the text to be identified.
Subvector acquiring unit 11, specifically for determining the text with contextual information according to the Text eigenvector
Feature vector;According to the preset feature extraction strategy, the Text eigenvector with contextual information is carried out special
Sign is extracted, and multiple feature subvectors are obtained.
Wherein, subvector acquiring unit 11 is specifically used for true when determining the Text eigenvector with contextual information
Fixed the multiple participle corresponding position vector in text to be identified;By the vector of the multiple participle respectively with it is corresponding
Position vector be added, obtain the addition vector of multiple participles;According to the addition vector of the multiple participle, determining has up and down
The Text eigenvector of literary information.
Area determination unit 12, multiple feature subvectors for being obtained according to the vector acquiring unit 11 and preset
Machine learning model determines that the text to be identified includes the area information of specific content, the preset machine learning model
For multiple feature subvectors according to any text, the area information that any text includes specific content is exported, it is described
Preset machine learning module includes disaggregated model or regression model.
The area determination unit 12, if being disaggregated model specifically for preset machine learning model, according to described point
Class model, determine the multiple feature subvector it is corresponding whether include specific content result information;According to described more
A result information and feature extraction strategy determine that the text to be identified includes the area information of specific content.Wherein, region is true
Order member 12 can be according to the feature extraction strategy, by the multiple regions in the multiple result information and the text to be identified
It is corresponded to;According to the corresponding result information of text to be identified of each region, determine that the file to be identified includes in specific
The area information of appearance.
It should be noted that if subvector acquiring unit 11 to the above-mentioned Text eigenvector with contextual information into
It is the feature extraction strategy according to a variety of sliding windows, respectively to the text with contextual information when row feature extraction
Feature vector carries out feature extraction, obtains multiple groups feature subvector, every group of feature subvector includes multiple feature subvectors;Then institute
Area determination unit 12 is stated, specifically for being determined respectively according to every group of feature subvector and preset machine learning model
Multiple groups area information determines that the text to be identified includes that the final area of specific content is believed according to the multiple groups area information
Breath.
Further, the content identification apparatus of the present embodiment can also include: training unit 13, for determining specific content
The initial model of identification, the initial model of the specific content identification include that vector obtains module and the machine learning model,
The vector obtains the step of module is for executing the acquisition Text eigenvector and obtaining feature subvector;The engineering
It practises module to be used to obtain multiple feature subvectors that module obtains according to the vector, determines that the text to be identified includes
The area information of specific content;Determine training sample, include in the training sample multiple texts and each text whether include
The markup information of specific content;The initial model identified by the specific content is determined in each text respectively comprising specific interior
The initial results of the area information of appearance;The initial results and the instruction determined according to the initial model of specific content identification
Practice the markup information in sample, adjusts the preset parameter value in the initial model of the specific content identification, it is final to obtain
Specific content identification model.After the training of training unit 13 obtains specific content identification model, above-mentioned vector acquiring unit 10, son
The specific content identification model that vector acquisition 11 and area determination unit 12 can be obtained by training unit 13, definitive result letter
Breath.
Further, training unit 13, if being also used to the adjustment to the preset parameter value meets any as follows stop
Only condition then stops the adjustment to the preset parameter value: preset number is equal to the adjustment number of the preset parameter value,
The difference of the preset parameter value of the preset parameter value and last adjustment that currently adjust is less than a threshold value.
As it can be seen that subvector acquiring unit 11 can be according to preset feature extraction in the content identification apparatus of the present embodiment
The Text eigenvector of text tactful and to be identified obtains the corresponding multiple feature subvectors of text to be identified, and then region is true
Order member 12 determines that the text to be identified includes spy further according to the multiple feature subvector and preset machine learning model
Determine the area information of content.This way it is not necessary to whether can be automatically identified in text to be identified by artificial comprising specific interior
Hold, if comprising that the region comprising specific content can be precisely located out, so that the filtering to specific content is more accurate.
The embodiment of the present invention also provides a kind of terminal device, and structural schematic diagram is as shown in Figure 10, which can be because
Configuration or performance are different and generate bigger difference, may include one or more central processing units (central
Processing units, CPU) 20 (for example, one or more processors) and memory 21, one or more are deposited
Store up the storage medium 22 (such as one or more mass memory units) of application program 221 or data 222.Wherein, it stores
Device 21 and storage medium 22 can be of short duration storage or persistent storage.Be stored in storage medium 22 program may include one or
More than one module (diagram does not mark), each module may include to the series of instructions operation in terminal device.More into one
Step ground, central processing unit 20 can be set to communicate with storage medium 22, execute one in storage medium 22 on the terminal device
Series of instructions operation.
Specifically, application program of the application program 221 stored in storage medium 22 including content recognition, and the program
It may include the vector acquiring unit 10 in above content identification device, subvector acquiring unit 11,12 He of area determination unit
Training unit 13, herein without repeating.Further, central processing unit 20 can be set to communicate with storage medium 22,
The corresponding sequence of operations of application program of the content recognition stored in storage medium 22 is executed on the terminal device.
Terminal device can also include one or more power supplys 23, one or more wired or wireless networks connect
Mouth 24, one or more input/output interfaces 25, and/or, one or more operating systems 223, such as Windows
ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by content identification apparatus described in above method embodiment can be shown in Fig. 10 based on this
The structure of terminal device.
The embodiment of the present invention also provides a kind of storage medium, and the storage medium stores a plurality of instruction, and described instruction is suitable for
It is loaded as processor and executes the content identification method as performed by above-mentioned content identification apparatus.
The embodiment of the present invention also provides a kind of terminal device, including pocessor and storage media, the processor, for real
Existing each instruction;The storage medium is for storing a plurality of instruction, and described instruction is for being loaded by processor and being executed as above-mentioned
Content identification method performed by content identification apparatus.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage
Medium may include: read-only memory (ROM), random access memory ram), disk or CD etc..
It is provided for the embodiments of the invention content identification method above and device is described in detail, it is used herein
A specific example illustrates the principle and implementation of the invention, and the above embodiments are only used to help understand
Method and its core concept of the invention;At the same time, for those skilled in the art is having according to the thought of the present invention
There will be changes in body embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention
Limitation.
Claims (12)
1. a kind of content identification method characterized by comprising
Obtain the Text eigenvector of text to be identified;
According to preset feature extraction strategy and the Text eigenvector, the corresponding multiple features of the text to be identified are obtained
Subvector;
According to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific content
Area information, the preset machine learning model is used for according to multiple feature subvectors of any text, and output is described to appoint
One text includes the area information of specific content, and the preset machine learning module includes disaggregated model or regression model.
2. the method as described in claim 1, which is characterized in that the Text eigenvector for obtaining text to be identified, specifically
Include:
The text to be identified is subjected to word segmentation processing, obtains multiple participles;
According to the corresponding relationship of preset participle and vector, the corresponding vector of the multiple participle is determined respectively;
The Text eigenvector of the text to be identified is combined into the corresponding vector of the multiple participle.
3. the method as described in claim 1, which is characterized in that described special according to preset feature extraction strategy and the text
Vector is levied, the corresponding multiple feature subvectors of the text to be identified is obtained, specifically includes:
The Text eigenvector with contextual information is determined according to the Text eigenvector;
According to the preset feature extraction strategy, feature is carried out to the Text eigenvector with contextual information and is mentioned
It takes, obtains multiple feature subvectors.
4. method as claimed in claim 3, which is characterized in that the Text eigenvector includes the vector of multiple participles, then
It is described that the Text eigenvector with contextual information is determined according to the Text eigenvector, it specifically includes:
Determine the multiple participle corresponding position vector in text to be identified;
The vector of the multiple participle is added with corresponding position vector respectively, obtains the addition vector of multiple participles;
According to the addition vector of the multiple participle, determining has the Text eigenvector of contextual information.
5. method as claimed in claim 3, which is characterized in that it is described according to the preset feature extraction strategy, to described
Text eigenvector with contextual information carries out feature extraction, obtains multiple feature subvectors, specifically includes:
According to the feature extraction strategy of a variety of sliding windows, the Text eigenvector to described with contextual information is carried out respectively
Feature extraction, obtains multiple groups feature subvector, and every group of feature subvector includes multiple feature subvectors;
It is then described according to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes spy
The area information for determining content, specifically includes: according to every group of feature subvector and preset machine learning model, determining respectively
Multiple groups area information determines that the text to be identified includes that the final area of specific content is believed according to the multiple groups area information
Breath.
6. such as method described in any one of claim 1 to 5, which is characterized in that the machine learning model is disaggregated model, then
It is described according to the multiple feature subvector and preset machine learning model, determine that the text to be identified includes specific content
Area information, specifically include:
According to the disaggregated model, determine the multiple feature subvector is corresponding whether believe by the result comprising specific content
Breath;
According to the multiple result information and feature extraction strategy, determine that the text to be identified includes that the region of specific content is believed
Breath.
7. method as claimed in claim 6, which is characterized in that described according to the multiple result information and feature extraction plan
Slightly, it determines that the text to be identified includes the area information of specific content, specifically includes:
According to the feature extraction strategy, the multiple regions in the multiple result information and the text to be identified are carried out pair
It answers;
According to the corresponding result information of text to be identified of each region, determine that the file to be identified includes the area of specific content
Domain information.
8. such as method described in any one of claim 1 to 5, which is characterized in that the method also includes:
Determine that the initial model of specific content identification, the initial model of the specific content identification include that vector obtains module and institute
Machine learning model is stated, the vector obtains module and is used to execute the acquisition Text eigenvector and obtains feature subvector
Step;The machine learning module is used to obtain multiple feature subvectors that module obtains according to the vector, determines institute
State the area information that text to be identified includes specific content;
Determine training sample, include in the training sample multiple texts and each text whether include specific content mark letter
Breath;
The initial model identified by the specific content determines the area information in each text comprising specific content respectively
Initial results;
Markup information in the initial results determined according to the initial model of specific content identification and the training sample,
The preset parameter value in the initial model of the specific content identification is adjusted, to obtain final specific content identification model.
9. method according to claim 8, which is characterized in that if met to the adjustment of the preset parameter value following any
Stop condition then stops the adjustment to the preset parameter value:
Preset number, the preset parameter value currently adjusted and last adjustment are equal to the adjustment number of the preset parameter value
Preset parameter value difference less than a threshold value.
10. a kind of content identification apparatus characterized by comprising
Vector acquiring unit, for obtaining the Text eigenvector of text to be identified;
Subvector acquiring unit, for obtaining described wait know according to preset feature extraction strategy and the Text eigenvector
The corresponding multiple feature subvectors of other text;
Area determination unit, for determining described wait know according to the multiple feature subvector and preset machine learning model
Other text includes the area information of specific content, and the preset machine learning model is used for multiple features according to any text
Subvector, exports the area information that any text includes specific content, and the preset machine learning module includes classification
Model or regression model.
11. a kind of storage medium, which is characterized in that the storage medium stores a plurality of instruction, and described instruction is suitable for by processor
It loads and executes content identification method as described in any one of claim 1 to 9.
12. a kind of terminal device, which is characterized in that including pocessor and storage media, the processor, for realizing each finger
It enables;
The storage medium is for storing a plurality of instruction, and described instruction by processor for being loaded and executing such as claim 1 to 9
Described in any item content identification methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910369604.6A CN110162628B (en) | 2019-05-06 | 2019-05-06 | Content identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910369604.6A CN110162628B (en) | 2019-05-06 | 2019-05-06 | Content identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110162628A true CN110162628A (en) | 2019-08-23 |
CN110162628B CN110162628B (en) | 2023-11-10 |
Family
ID=67633440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910369604.6A Active CN110162628B (en) | 2019-05-06 | 2019-05-06 | Content identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110162628B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490199A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus of text identification, storage medium and electronic equipment |
CN110750677A (en) * | 2019-10-12 | 2020-02-04 | 腾讯科技(深圳)有限公司 | Audio and video recognition method and system based on artificial intelligence, storage medium and server |
CN110889717A (en) * | 2019-11-14 | 2020-03-17 | 腾讯科技(深圳)有限公司 | Method and device for filtering advertisement content in text, electronic equipment and storage medium |
CN111126410A (en) * | 2019-12-31 | 2020-05-08 | 讯飞智元信息科技有限公司 | Character recognition method, device, equipment and readable storage medium |
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN113743121A (en) * | 2021-09-08 | 2021-12-03 | 平安科技(深圳)有限公司 | Long text entity relation extraction method and device, computer equipment and storage medium |
CN114792423A (en) * | 2022-05-20 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and storage medium |
CN116701640A (en) * | 2023-08-04 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Watermark identification model generation method, watermark identification device and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426356A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Target information identification method and apparatus |
US20180121801A1 (en) * | 2016-10-28 | 2018-05-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for classifying questions based on artificial intelligence |
-
2019
- 2019-05-06 CN CN201910369604.6A patent/CN110162628B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426356A (en) * | 2015-10-29 | 2016-03-23 | 杭州九言科技股份有限公司 | Target information identification method and apparatus |
US20180121801A1 (en) * | 2016-10-28 | 2018-05-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for classifying questions based on artificial intelligence |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490199A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus of text identification, storage medium and electronic equipment |
CN112541373B (en) * | 2019-09-20 | 2023-10-31 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
WO2021051957A1 (en) * | 2019-09-20 | 2021-03-25 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method, and related device |
CN110750677A (en) * | 2019-10-12 | 2020-02-04 | 腾讯科技(深圳)有限公司 | Audio and video recognition method and system based on artificial intelligence, storage medium and server |
CN110750677B (en) * | 2019-10-12 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Audio and video identification method and system based on artificial intelligence, storage medium and server |
CN110889717A (en) * | 2019-11-14 | 2020-03-17 | 腾讯科技(深圳)有限公司 | Method and device for filtering advertisement content in text, electronic equipment and storage medium |
CN111126410A (en) * | 2019-12-31 | 2020-05-08 | 讯飞智元信息科技有限公司 | Character recognition method, device, equipment and readable storage medium |
CN111126410B (en) * | 2019-12-31 | 2022-11-18 | 讯飞智元信息科技有限公司 | Character recognition method, device, equipment and readable storage medium |
CN113743121A (en) * | 2021-09-08 | 2021-12-03 | 平安科技(深圳)有限公司 | Long text entity relation extraction method and device, computer equipment and storage medium |
CN113743121B (en) * | 2021-09-08 | 2023-11-21 | 平安科技(深圳)有限公司 | Long text entity relation extraction method, device, computer equipment and storage medium |
CN114792423B (en) * | 2022-05-20 | 2022-12-09 | 北京百度网讯科技有限公司 | Document image processing method and device and storage medium |
CN114792423A (en) * | 2022-05-20 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and storage medium |
CN116701640A (en) * | 2023-08-04 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Watermark identification model generation method, watermark identification device and electronic equipment |
CN116701640B (en) * | 2023-08-04 | 2024-01-26 | 腾讯科技(深圳)有限公司 | Watermark identification model generation method, watermark identification device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110162628B (en) | 2023-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162628A (en) | A kind of content identification method and device | |
CN108363790A (en) | For the method, apparatus, equipment and storage medium to being assessed | |
CN112270196B (en) | Entity relationship identification method and device and electronic equipment | |
CN111339305B (en) | Text classification method and device, electronic equipment and storage medium | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN111666761B (en) | Fine-grained emotion analysis model training method and device | |
CN108959474B (en) | Entity relation extraction method | |
CN114490950B (en) | Method and storage medium for training encoder model, and method and system for predicting similarity | |
CN108416032A (en) | A kind of file classification method, device and storage medium | |
CN112269868A (en) | Use method of machine reading understanding model based on multi-task joint training | |
CN114757176A (en) | Method for obtaining target intention recognition model and intention recognition method | |
CN110262942A (en) | A kind of log analysis method and device | |
CN109271513B (en) | Text classification method, computer readable storage medium and system | |
CN110969005B (en) | Method and device for determining similarity between entity corpora | |
CN113486174B (en) | Model training, reading understanding method and device, electronic equipment and storage medium | |
CN116956289B (en) | Method for dynamically adjusting potential blacklist and blacklist | |
CN116522912B (en) | Training method, device, medium and equipment for package design language model | |
CN111639189B (en) | Text graph construction method based on text content features | |
CN112749530B (en) | Text encoding method, apparatus, device and computer readable storage medium | |
CN113836892A (en) | Sample size data extraction method and device, electronic equipment and storage medium | |
CN113609390A (en) | Information analysis method and device, electronic equipment and computer readable storage medium | |
CN115905500B (en) | Question-answer pair data generation method and device | |
CN117332090B (en) | Sensitive information identification method, device, equipment and storage medium | |
CN114925166A (en) | Method and device for training choice question solving model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |