CN113971402A - Content identification method, device, medium and electronic equipment - Google Patents

Content identification method, device, medium and electronic equipment Download PDF

Info

Publication number
CN113971402A
CN113971402A CN202111235927.XA CN202111235927A CN113971402A CN 113971402 A CN113971402 A CN 113971402A CN 202111235927 A CN202111235927 A CN 202111235927A CN 113971402 A CN113971402 A CN 113971402A
Authority
CN
China
Prior art keywords
content
identification
classification
identified
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111235927.XA
Other languages
Chinese (zh)
Inventor
陈维识
洪进栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202111235927.XA priority Critical patent/CN113971402A/en
Publication of CN113971402A publication Critical patent/CN113971402A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a content recognition method, apparatus, medium, and electronic device, the method comprising: receiving content to be identified; obtaining identification results of a plurality of content identification pairs according to a plurality of content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions, and determining the identification results based on the dimension characteristics, and the dimensions are used for representing various types of components in the content to be identified; and determining a target recognition result of the content to be recognized according to the plurality of recognition results. Therefore, the recognition among the features with different dimensions can be realized, the accuracy of the recognition result is improved, the manual workload is saved, and the use by a user is facilitated.

Description

Content identification method, device, medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a content identification method, apparatus, medium, and electronic device.
Background
The development of computer technology has brought about more and more attention to contents such as news and information, and the rise of a content sharing platform makes users who can distribute contents more diverse. A platform usually carries a lot of UGCs (User-generated content) and PGCs (Professional-generated content), and in order to ensure validity of the content, a worker is usually required to check the issued content to determine whether the content to be issued is copied to other issued content.
However, while a lot of manpower is required in the above process, there is a strong possibility that misjudgment or missed judgment is caused due to different human subjective understandings, and it is more difficult to cope with a large amount of content review processes.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a content identification method, including:
receiving content to be identified;
obtaining identification results of a plurality of content identification pairs according to a plurality of content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions, and determining the identification results based on the dimension characteristics, and the dimensions are used for representing various types of components in the content to be identified;
and determining a target recognition result of the content to be recognized according to the plurality of recognition results.
In a second aspect, the present disclosure provides a content recognition apparatus, the apparatus comprising:
the receiving module is used for receiving the content to be identified;
the processing module is used for obtaining the identification results of the content identification pairs according to the content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining the dimensional characteristics of the content to be identified and the candidate content under multiple dimensions, and determining the identification results based on the dimensional characteristics, and the dimensions are used for representing multiple types of components in the content to be identified;
and the first determining module is used for determining a target recognition result of the content to be recognized according to a plurality of recognition results.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.
In the technical scheme, the content to be identified and the existing content form a content identification pair, so that the association degree between the content to be identified and the existing content is determined based on the content identification pair, the identification result of the content to be identified can be comprehensively determined based on the results of a plurality of content identification pairs, and the automatic audit of the content to be identified is realized. Therefore, according to the technical scheme, when the content to be recognized is recognized and audited, the components of different types in the content to be recognized can be recognized, so that similar coincidence behaviors among different components can be recognized, recognition among features of different dimensions is realized, accuracy of a recognition result is improved, misjudgment and misjudgment of the content to be recognized are effectively avoided, manual workload is effectively saved, accuracy of the recognition result is improved, effective data support is provided for guaranteeing issuing of original content, and the content to be recognized is convenient for a user to use.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow diagram of a content identification method provided according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of an exemplary implementation of obtaining recognition results for a plurality of content recognition pairs based on a plurality of content recognition pairs including content to be recognized and a content recognition model;
FIG. 3 is a schematic diagram of a headline feature extraction submodel provided in accordance with an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an image feature extraction submodel provided in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a structure of a content recognition model provided according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a content recognition apparatus provided according to an embodiment of the present disclosure;
FIG. 7 illustrates a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
In a practical application scenario, the applicant researches and discovers that similar plagiarism phenomena can occur in content writing as follows: duplicating the image in the released content, namely duplicating the characters in the image in the existing content in a text form to form new text content; or modifying the video captions in the published content into texts, namely extracting caption information in the video in the existing content for duplicating, thereby forming new text content; or simply overwrite another content to form a new content. In the above process, there may be a case where the similarity between different types of information is too high, and a problem of missed judgment may occur for such a phenomenon in the related art. The content corresponding to the articles with similar titles may have a large difference, and erroneous judgment is easy to occur for such phenomena. In the prior art, effective identification is difficult to carry out, and the workload required by manual identification is increased greatly. Based on this, the present disclosure also provides the following embodiments.
Fig. 1 is a flowchart of a content identification method according to an embodiment of the present disclosure, and as shown in fig. 1, the method may include:
in step 11, the content to be identified is received, where the content to be identified may be uploaded content that needs to be checked, such as uploaded UGC content, uploaded PGC content, and also uploaded content of an official account, and may be determined according to a requirement for checking identification in actual needs.
In step 12, obtaining recognition results of a plurality of content recognition pairs according to a plurality of content recognition pairs including a content to be recognized and a content recognition model, where each content recognition pair further includes a candidate content in a preset set, the content recognition model is configured to obtain dimensional features of the content to be recognized and the candidate content in a plurality of dimensions, and determine the recognition result based on the dimensional features, the dimensions are used to represent multiple types of components in the content to be recognized, such as a dimension may be a text, a title, an image, audio, a video, and the like, so that recognition may be performed based on the features in the plurality of dimensions, thereby implementing recognition between features in different dimensions, and improving accuracy of the recognition result.
The preset set may be set according to an actual usage scenario, for example, the preset set may be a set formed by all content data that can be obtained by a platform for auditing, or a set formed for a certain type of content data, for example, a sports news set, an entertainment news set, and the like, which is not limited in this disclosure.
In one possible embodiment, the candidate content may be determined by:
and carrying out similarity calculation on the content to be identified and a plurality of stored contents in the preset set so as to obtain the similarity between the content to be identified and the stored contents. For example, the content to be identified and the stored content may be generated into their corresponding vector representations based on a consistent vector conversion manner, and then similarity calculation may be performed based on the vector representations, such as calculating an inverse of a distance, a cosine value of an included angle, and the like.
And then, determining candidate contents according to the similarity between the contents to be identified and the stored contents. If the stored content with the similarity greater than the threshold value can be determined as the candidate content, or the stored content with the similarity greater than the threshold value can be ranked according to the sequence of the similarity from large to small, so that the first N stored content is obtained and determined as the candidate content, and N can be set according to the actual application scene, for example, 10 or 15, so that the data calculation amount is reduced as much as possible while the calculation accuracy is ensured.
Therefore, the determined candidate content is a content having a certain similarity with the content to be identified, and in this step, the content to be identified and any candidate content may form a content identification pair, so as to further determine whether the content similarity between the two is too high in a one-to-one comparison manner.
In step 13, a target recognition result of the content to be recognized is determined according to a plurality of recognition results, wherein the target recognition result is used for indicating the original degree of the content to be recognized.
In the technical scheme, the content to be identified and the existing content form a content identification pair, so that the association degree between the content to be identified and the existing content is determined based on the content identification pair, the identification result of the content to be identified can be comprehensively determined based on the results of a plurality of content identification pairs, and the automatic audit of the content to be identified is realized. Therefore, according to the technical scheme, when the content to be recognized is recognized and audited, the content to be recognized can be recognized based on different types of components in the content to be recognized, so that similar coincidence behaviors between the different components of the content to be recognized and the published content can be recognized, the features of different dimensions can be recognized, the accuracy of the recognition result is improved, the conditions of misjudgment and missed judgment of the content to be recognized are effectively avoided, the manual workload is effectively saved, the accuracy of the recognition result is improved, meanwhile, effective data support is provided for guaranteeing the publishing of the original content, and the content to be recognized is convenient for users to use.
In a possible embodiment, the dimension corresponds to a feature extraction submodel of the content identification model, for example, the dimension may be a title, a text, an image, a video, an audio, and the like, and the content identification model may include a title feature extraction submodel, a text feature extraction submodel, an image feature extraction submodel, a video feature extraction submodel, and an audio feature extraction submodel.
Accordingly, an exemplary implementation manner of obtaining the recognition results of the plurality of content recognition pairs according to the plurality of content recognition pairs including the content to be recognized and the content recognition model in step 12 is as follows, as shown in fig. 2, and this step may include:
in step 21, for each content identification pair, according to the content to be identified, the candidate content in the content identification pair, and the feature extraction submodel corresponding to each dimension, the dimensional features corresponding to the content to be identified and the candidate content in multiple dimensions are determined.
The structures of the different feature extraction submodels may be the same or different, and this is not limited in this disclosure. For example, the title feature extraction sub-model may be Word segmentation performed on the title, Word Embedding performed based on the Word segmentation, for example, a Word vector of the Word segmentation may be obtained by Word2vec, and then an Embedding vector corresponding to the title is generated based on the Word vector. Therefore, the title of the content to be identified is input into the title feature extraction submodel, and the corresponding Embedding vector of the content to be identified in the dimension of the title is obtained, as shown in an area a in fig. 3; similarly, the title of the candidate content is input into the title feature extraction sub-model, and the Embedding vector corresponding to the content to be identified in the dimension of the title is obtained, as shown in the area B in fig. 3.
As an example, the text feature extraction sub-model may have the same structure as the title feature extraction sub-model, that is, the text feature extraction sub-model performs Word segmentation, performs Word Embedding segmentation based on the Word segmentation, and further generates an Embedding vector corresponding to the text based on the Word vector.
For another example, the image feature extraction submodel may perform feature extraction through a CNN (Convolutional Neural Networks) or a ResNet residual error network, so as to obtain an Embedding vector corresponding to the image, that is, an image Embedding vector. As shown in fig. 4, the Image in the content to be recognized is represented as D1-Dn, and the Image in the candidate content is represented as H1-Hm, so that the features of each Image can be sequentially extracted according to the display order of the images and Image Embedding vector encoding is performed, and then the corresponding Embedding vector of the content to be recognized in the dimension of the Image is obtained, as shown in fig. 4.
For the audio dimensionality, the audio feature extraction submodel can identify the audio in the audio dimensionality in a voice recognition mode, obtain that the audio corresponds to a text, and further determine a vector corresponding to the identified text as a vector corresponding to the audio. For a video dimension, the video feature extraction submodel may identify an audio therein by an Automatic Speech Recognition (ASR) technology, or identify a subtitle in a video by Optical Character Recognition (OCR) to obtain that the video corresponds to a text, and obtain images in the video at intervals of a preset time period to obtain an image sequence, so that a vector corresponding to the text and a vector corresponding to each image in the image sequence may be determined based on the above manner to obtain a vector corresponding to the video.
In step 22, for each dimension, the features of the content to be identified and the candidate content in the dimension are spliced to obtain a spliced feature corresponding to the dimension.
Illustratively, as shown in fig. 3, the a region is used for acquiring the dimensional features of the content to be identified, and the B region is used for acquiring the dimensional features of the candidate content. In the embodiment of the disclosure, the content to be identified and the candidate content are not directly compared, but the content to be identified and the candidate content are fused together to obtain a feature for identification based on the feature. Therefore, when the features of the content to be identified and the candidate content are spliced, the features of the content to be identified and the candidate content can be separated by the separation mark, and the features of the content to be identified and the candidate content can be effectively distinguished while the splicing of the features of the content to be identified and the candidate content is realized. The setting of the separation mark may be set according to a specific application scenario, and vector encoding is performed on the separation mark, so that the separation mark may be marked as a separation vector into the splicing feature, as shown in a region C in fig. 3. Similarly, as shown in fig. 4, in the image dimension, the separation mark may be a separation map, so that feature extraction and Embedding vector encoding may be performed on the separation map, an Embedding vector corresponding to the separation map is obtained, and then the stitching feature is obtained.
In step 23, according to the splicing features corresponding to the multiple dimensions, the fusion features corresponding to the content identification pairs are obtained.
In step 24, the recognition result of the content recognition pair is determined based on the classification submodel and the fusion feature of the content recognition model.
As an example, the splicing features corresponding to the multiple dimensions may be spliced based on the direction of dimension splicing to obtain the fusion feature, that is, the fusion feature includes the splicing features corresponding to the multiple dimensions. Furthermore, the fusion feature simultaneously contains the features of the content to be identified and the candidate content, so that the identification and classification can be carried out based on the fusion feature and the classification submodel, and the classification result can be obtained.
Therefore, by the technical scheme, the candidate content and each dimension of the content to be identified can be subjected to feature extraction, so that the features of the content to be identified and the candidate content under each dimension can be obtained, the features under multiple dimensions are fused, the fused features can simultaneously contain the features of the content to be identified and the candidate content under multiple dimensions, the accuracy and the comprehensiveness of the features subjected to classification identification are ensured, and accurate data support is provided for classification of identification results. Meanwhile, classification can be performed based on the fusion features, and compared with the calculation similarity in the related technology, more classification results can be obtained, so that the content identification method can be expanded, and the application range of the content identification method is widened.
In a possible embodiment, in step 23, according to the splicing features corresponding to multiple dimensions, an exemplary implementation manner of obtaining the content identification for the corresponding fusion features is as follows, and the step may include:
and processing the splicing features based on the first attention layer aiming at the splicing features corresponding to each dimension to obtain the attention features corresponding to the splicing features.
For example, the splicing feature may be processed based on a Transformer model, where the Transformer model includes an attention layer, so that when the splicing feature is processed, the attention information of the content to be identified to the candidate content and the self-attention information of the content to be identified may be fused on the basis of the splicing feature, so as to subsequently determine a similar portion in the candidate content and content quality information of the content to be identified.
For example, in order to further improve the accuracy of feature processing, a multi-layer transform model may be used for processing, for example, a two-layer transform may be used for attention processing for text and image portions with complex content information, and a single-layer transform may be used for attention processing for title portions.
And splicing the attention features under each dimension to obtain a multi-dimension splicing feature.
For example, the size of the attention feature in each dimension is the same, for example, the size of the attention feature in each dimension is a matrix of 10 × 32, the attention features may be spliced based on a dimension splicing manner, and after the attention features in 3 dimensions such as a title, a text, an image, and the like are spliced, a multi-dimension splicing feature with a size of 10 × 96 may be obtained.
And processing the multi-dimensional splicing feature based on a second attention layer to obtain the fusion feature.
The second attention layer and the first attention layer may have the same or different structures, and may be set according to an actual application scenario. The attention layer processes the multi-dimensional splicing feature again, so that the fusion feature can contain the multi-dimensional splicing feature, and different weights can be given to information in the multi-dimensional splicing feature. Through attention processing, the relevance among the original data can be found and important characteristics in the original data can be highlighted. For example, the multi-dimensional splicing features may include a title feature, a text feature and an image feature of the content to be identified, and a title feature, a text feature and an image feature of the candidate content, so that attention calculation processing is performed through the second attention layer, so that the title feature of the content to be identified can also concern the title feature, the text feature and the image feature of the candidate content, and association between the feature of the content to be identified and the features of the candidate content in multiple dimensions is obtained, thereby ensuring accuracy of identification of subsequent content to be identified, and providing data support for identification of similarity between different types of information to further determine plagiarism or carbon copy behavior.
In a possible embodiment, the identification result includes an identification parameter that the content to be identified corresponds to a classification, where the classification is used to indicate the degree of originality of the content to be identified, and the identification parameter may be a probability that the content to be identified corresponds to the classification.
Accordingly, the exemplary implementation of determining the recognition result of the content recognition pair based on the classification submodel of the content recognition model and the fusion feature may include:
and obtaining feature vectors of which the fusion features respectively correspond to a plurality of classifications according to the fusion features and the classification submodels, wherein the classifications comprise similar content classification, low-quality content classification and original content classification. Wherein the dimensions in the feature vector correspond one-to-one to the classes in the plurality of classes.
The similar content classification is used for indicating that the coincidence degree of the content to be identified and the existing content is too high, namely the content to be identified is the similar content; the low-quality content classification is used for representing that identification and verification are difficult to perform due to the problems of less word number, fuzzy image or unclear ideographism and the like of the content to be identified; the original content classification is used for indicating that the coincidence degree of the content to be identified and the existing content is low, namely the content to be identified is original content.
The classification submodel may be implemented based on a GAP (Global Average Pooling) layer and a Dense connection layer. For example, the fused features may be input into the GAP layer to globally average the feature maps and output to the sense layer to obtain the feature vectors for classification. Through the GAP layer, the conversion between the input fusion characteristics and the determined characteristic vectors for classification is simpler, and compared with the FC full-connection layer, the layer does not need a large amount of training and tuning parameters, so that the space parameters can be effectively reduced, the model is more robust, and overfitting of the model is avoided. The structures of the DAP layer and the sense layer may be selected from structures conventional in the art, which are not limited in the present disclosure.
And processing the feature vector to obtain the identification parameters of the fusion features corresponding to each classification so as to obtain the identification result.
The feature vector may then be processed by a softmax activation function to determine the probability that the feature vector corresponds to each class based on the features of each dimension in the feature vector, i.e., to obtain the identification parameters for each class.
Through the technical scheme, when the content to be identified is identified, compared with a scheme that whether the content to be identified is similar to the published content is determined based on the similarity calculated between the content to be identified and the existing content in the related technology, the method and the device can effectively identify the duplication and plagiarism phenomena related to different types of content, such as 'duplicating the image in the published content', 'modifying the video caption in the published content into the text', and the like, by synthesizing the two pieces of content in the content identification pair into the fusion characteristic for comprehensive judgment, so that the plagiarism or the duplication behaviors existing among different types of information can be effectively identified, and effective data support is provided for subsequently determining the accuracy of the identification result of the content to be identified.
In a possible embodiment, the identification result includes an identification parameter that the content to be identified corresponds to a classification, where the classification is used to indicate the degree of originality of the content to be identified, and the identification parameter may be a probability that the content to be identified corresponds to the classification.
Accordingly, an exemplary implementation manner of determining the target recognition result of the content to be recognized according to the plurality of recognition results is as follows, and the step may include:
and acquiring identification parameters respectively corresponding to the similar content classifications in the plurality of identification results, namely determining probability values corresponding to the similar content classifications determined in the plurality of content identification pairs.
If the identification parameters of the similar content classifications meet similar identification conditions, determining that the target identification result is a similar content classification, wherein the similar identification conditions may include the following conditions:
in a first case, the maximum value of the identification parameter corresponding to the similar content classification is greater than a first preset threshold.
As described above, a plurality of candidate contents may be initially screened out based on the content to be identified, and then the content to be identified is subsequently compared with each candidate identification content one by one. For example, if it is determined that the probability value of the content to be recognized corresponding to the similar content classification is too large compared with a candidate content, it may be determined that the similarity between the content to be recognized and the candidate content is too high, and at this time, it may be determined that the target recognition result of the content to be recognized is the similar content classification. Further, in the present disclosure, the determination may be directly performed based on the maximum value of the identification parameter corresponding to the similar content classification, so that the data amount required to be compared and determined may be reduced, the accuracy of the target identification result may be ensured, and the identification efficiency may be improved.
As another example, if the maximum value of the identification parameter corresponding to the similar content classification is less than or equal to the first preset threshold, a plurality of identification results corresponding to the content to be identified may be output for manual review to further determine a target identification result of the content to be identified.
In a second case, the maximum value of the identification parameters corresponding to the similar content classification is less than or equal to the first preset threshold, and the average value of the identification parameters corresponding to the similar content classification is greater than a second preset threshold, at this time, it may also be determined that the target identification result is a similar content classification.
The maximum value of the identification parameter corresponding to the similar content classification is smaller than or equal to the first preset threshold, which indicates that the probability that the content to be identified and any candidate content correspond to the similar content classification does not meet the requirement, that is, the similarity of the content to be identified and any candidate content is not high, and the content to be identified is not highly similar to one of the candidate contents. In this case, it may be further determined whether the content to be identified is similar to a plurality of candidate contents at the same time.
Accordingly, in the embodiment of the present disclosure, in a case that the maximum value of the identification parameter corresponding to the similar content classification is less than or equal to the first preset threshold, a case that the average value of the identification parameter corresponding to the similar content classification may be further determined, where the average value of the identification parameter corresponding to the similar content classification is greater than the second preset threshold, which indicates that the content to be identified is more similar to each candidate content, that is, indicates that the content to be identified is similar to a plurality of candidate contents, which indicates that the content to be identified may be a partial content duplication selected from the plurality of candidate contents at the same time, and at this time, the target identification result of the content to be identified may be determined as the similar content classification.
Therefore, by the technical scheme, the identification results of a plurality of content identification pairs can be comprehensively determined, the similar situation of the content to be identified corresponding to a single piece of content can be determined, the similar situation of the content to be identified corresponding to a plurality of pieces of content can be determined, the content to be identified can be comprehensively analyzed, the comprehensiveness and the accuracy of the identification results of the content to be identified are improved, and meanwhile, the application range of the scheme can be effectively expanded.
In another possible embodiment, an exemplary implementation manner of determining the target recognition result of the content to be recognized according to the plurality of recognition results is as follows, and the step may further include:
and acquiring identification parameters respectively corresponding to the low-quality content classification in the plurality of identification results, namely determining probability values corresponding to the low-quality content classification determined in the plurality of content identification pairs.
If the identification parameters of the low-quality content classifications meet a low-quality identification condition and the identification parameters of the similar content classifications do not meet the similar identification condition, determining that the target identification result is the low-quality content classification, wherein the low-quality identification condition may include the following conditions:
in the first case, the maximum value of the identification parameter corresponding to the low-quality content classification is greater than a third preset threshold.
Wherein it may be determined whether the content to be identified is low-quality content based on the identification scores corresponding to the low-quality content classifications in the identification results of the respective content identification pairs. For example, if it is determined that the probability value of the content to be recognized corresponding to the low-quality content classification is too large, the content to be recognized may be the low-quality content, and if the content to be recognized does not belong to the similar content classification, it may be determined that the target recognition result of the content to be recognized is the low-quality content classification. Further, in the present disclosure, the determination may be directly performed based on the maximum value of the identification parameter corresponding to the low-quality content classification, so that the amount of data that needs to be compared and determined may be reduced, and the accuracy of the target identification result may be ensured while the identification efficiency may be improved.
In the second case, the maximum value of the identification parameters corresponding to the low-quality content classification is less than or equal to the third preset threshold, and the average value of the identification parameters corresponding to the low-quality content classification is greater than a fourth preset threshold, and it may be determined that the target identification result is the low-quality content classification.
The maximum value of the identification parameter corresponding to the low-quality content classification is less than or equal to the third preset threshold, and the maximum value indicates that there is no clear directional result in the plurality of identification results of the content to be identified, so that a situation of an average value of the identification parameters corresponding to the low-quality content classification can be further determined, when the average value of the identification parameters corresponding to the low-quality content classification is greater than the fourth preset threshold, the probability that the content to be identified is comprehensively pointed to as the low-quality content is greater in the plurality of identification results of the content to be identified, and at this time, the target identification result of the content to be identified can be determined as the low-quality content classification.
Therefore, by the technical scheme, the identification results of a plurality of content identification pairs can be combined for comprehensive determination, so that whether the content to be identified is the content highly similar to the published content can be checked, meanwhile, the quality of the content to be identified can be checked, the writing quality of the content to be identified can be directly concerned in the content identification process, the one-sidedness checking caused by directly determining the content to be identified as the similar content or the original content is avoided, the content identification precision is improved, the comprehensiveness and the accuracy of the identification results of the content to be identified can be improved, the use range of the scheme can be further expanded, data support can be provided for avoiding the release of low-quality content, and the use experience of a user is improved.
In another possible embodiment, the exemplary implementation manner of determining the target recognition result of the content to be recognized according to a plurality of recognition results may further include:
and acquiring identification parameters respectively corresponding to the original content classification in the plurality of identification results, namely determining probability values corresponding to the original content classification and determined in the plurality of content identification pairs.
And if the maximum value of the identification parameters of the original content classifications is larger than a fifth preset threshold value, the identification parameters of the similar content classifications do not meet similar identification conditions, and the identification parameters of the low-quality content classifications do not meet low-quality identification conditions, determining that the target identification result is the original content classification.
In this embodiment, when it is determined that the maximum value of the identification parameter of the content to be identified, which is compared with a certain candidate content and corresponds to the original content classification, is greater than a fifth preset threshold value, which indicates that the content to be identified is compared with the candidate content as the original content, the content to be identified may be directly determined as the original content under the condition that the content to be identified is not similar to other candidate content or low-quality content.
It should be noted that the setting of the first preset threshold, the second preset threshold, the third preset threshold, the fourth preset threshold, and the fifth preset threshold may be set according to an actual application scenario, which is not limited in this disclosure.
Therefore, according to the technical scheme, when the content to be identified is classified as the original content compared with any candidate content, the classification of the original content cannot be directly determined, and the content to be identified can be determined as the original content only when the content to be identified is further determined to be written with low quality or without the phenomenon of overhigh similarity for other candidate contents, so that the accuracy of the identification result of the determined content to be identified can be effectively ensured, and accurate data support is provided for the issuing of subsequent content.
In another possible embodiment, the maximum value and the average value of the identification parameters respectively corresponding to the original content classification, the low-quality content classification and the similar content classification in the plurality of identification results can be displayed and output, so that a user can directly perform auditing judgment based on the information of each identification parameter during auditing, the accuracy of the target identification result of the content to be identified is further ensured, the manual workload is reduced to a certain extent, the misjudgment and the missed judgment of the content with the over-high similarity can be reduced, and the efficiency and the accuracy of content identification are improved.
In order to further explain the accuracy of the recognition result, a similar portion with a high similarity corresponding to the content to be recognized may be determined, that is, the candidate content in the content recognition pair with the maximum value of the recognition parameter may be directly determined as the similar portion, and then the content recognition pair is determined as the target content recognition pair. As an example, the candidate content in the target content identification pair may be directly displayed and output to display to the user a similar portion corresponding to the content to be identified.
In another possible embodiment, the method may further include:
and under the condition that the target identification result is determined to be similar content classification, determining a target position of which the attention parameter is greater than a preset threshold value in the fusion characteristics according to the fusion characteristics corresponding to the target content identification pair, wherein the target content identification pair is a content identification pair corresponding to the maximum value of the identification parameters of the similar content classification.
In the embodiment of the present disclosure, classification is performed based on the fusion features and the classification submodels, so as to obtain a recognition result. Therefore, when the content to be recognized is determined to be similar content based on the fusion feature, the larger the attention parameter is, the larger the influence of the part on the recognition result in the recognition and classification process is, that is, the contribution of the target position with the attention parameter larger than the preset threshold value in the fusion feature to the classification of the content to be recognized as similar content is larger, and the content corresponding to the target position can be taken as the similar part.
Further, determining the content corresponding to the target position in the candidate content of the target content identification pair as the coincidence content corresponding to the content to be identified;
and outputting the candidate content and the coincidence content in the target content identification pair.
When the candidate content is subjected to feature extraction, the candidate content is extracted based on the sequence of each part of the candidate content, for example, each paragraph of the text part is sequentially extracted, the text part features a1-Am respectively correspond to the features of the first paragraph to the mth paragraph in the text, so that the part of the candidate content corresponding to the target position, that is, the overlapped part of the candidate content with too high similarity of the content to be identified can be determined based on the determined target position. And then displaying the candidate content and the coincidence content.
Therefore, by the technical scheme, the content to be recognized can be prompted to be similar content to the user, the existing content with the high similarity corresponding to the content to be recognized and the similar part in the existing content can be further prompted to the user, so that data support can be provided for guaranteeing the accuracy of the target recognition result, the target recognition result is more clearly illustrated and displayed, the reliability of the target recognition result is improved, and the use experience of the user is improved.
In a possible embodiment, the dimensions correspond to feature extraction submodels of the content identification model one to one, and taking the dimensions including a title, a text and an image as an example, as shown in fig. 5, the dimensions are a schematic structural diagram of the content identification model, and the content identification model is obtained by:
and acquiring the training sample data, wherein the training sample data comprises sample content, associated content corresponding to the sample content and an associated label. The associated content may be determined similar content corresponding to the sample content, and the associated tag may indicate an original degree of the sample content relative to the associated content, such as the low-quality content category, the original content category, the similar content category, and the like described above.
For each training sample data, determining corresponding dimension characteristics of the sample content and the associated content under multiple dimensions respectively according to the sample content, the associated content corresponding to the sample content and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the characteristics of the sample content and the associated content in the dimension to obtain a spliced characteristic corresponding to the dimension; acquiring fusion characteristics corresponding to the training sample data according to the splicing characteristics corresponding to the multiple dimensions; and determining the recognition result of the training sample data based on the classification submodel of the content recognition model and the fusion characteristics.
The specific implementation of the above steps is the same as the related processing flow between the content to be identified and the candidate content, and is not described herein again.
And adjusting parameters of the content recognition model according to the recognition result and the associated label to obtain the trained content recognition model.
The identification error of the content identification model may be calculated based on the identification result and the associated tag, and when the identification error is greater than an error threshold, the parameters of the content identification model are adjusted based on the identification error, for example, the parameters may be adjusted based on a gradient descent method, and the training is stopped until the calculated identification error of the identification model is less than or equal to the error threshold, so as to obtain the trained content identification model.
Therefore, by the technical scheme, the content recognition model can be trained based on the sample content, the associated content corresponding to the sample content and the associated label, so that feature extraction, splicing fusion and classification recognition of the two contents for comparison are realized, and the features under different dimensions are subjected to aggregation judgment, so that the content recognition model can be used for recognizing different types of contents, the application range of the content recognition model is expanded, and the accuracy of a recognition result obtained based on the content recognition model is improved.
The present disclosure also provides a content recognition apparatus, as shown in fig. 6, the apparatus 10 including:
a receiving module 100, configured to receive content to be identified;
a processing module 200, configured to obtain recognition results of a plurality of content recognition pairs according to a plurality of content recognition pairs including the content to be recognized and a content recognition model, where each content recognition pair further includes a candidate content in a preset set, the content recognition model is configured to obtain dimensional features of the content to be recognized and the candidate content in a plurality of dimensions, and determine the recognition results based on the dimensional features, and the dimensions are used to represent multiple types of components in the content to be recognized;
the first determining module 300 is configured to determine a target recognition result of the content to be recognized according to a plurality of recognition results.
Optionally, the dimension corresponds to a feature extraction sub-model of the content recognition model, and the processing module includes:
the first determining submodule is used for determining the corresponding dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions respectively according to the content to be identified, the candidate content in the content identification pair and the feature extraction submodel corresponding to each dimension aiming at each content identification pair;
the first splicing submodule is used for splicing the to-be-identified content and the candidate content in the dimension aiming at each dimension to obtain a splicing feature corresponding to the dimension;
the fusion submodule is used for acquiring fusion characteristics corresponding to the content identification pairs according to the splicing characteristics corresponding to the multiple dimensions;
and the second determining submodule is used for determining the recognition result of the content recognition pair based on the classification submodel of the content recognition model and the fusion characteristics.
Optionally, the fusion submodule comprises:
the first processing submodule is used for processing the splicing features on the basis of a first attention layer aiming at the splicing features corresponding to each dimension to obtain attention features corresponding to the splicing features;
the second splicing submodule is used for splicing the attention features under each dimension to obtain multi-dimension splicing features;
and the second processing submodule is used for processing the multi-dimensional splicing feature based on a second attention layer to obtain the fusion feature.
Optionally, the identification result includes an identification parameter of the content to be identified corresponding to the classification, and the second determining sub-module includes:
the third processing submodule is used for obtaining feature vectors of which the fusion features respectively correspond to a plurality of classes according to the fusion features and the classification submodels, wherein the classes comprise similar content classes, low-quality content classes and original content classes;
and the fourth processing submodule is used for processing the feature vector to obtain the identification parameters of the fusion features corresponding to each classification so as to obtain the identification result.
Optionally, the identification result includes an identification parameter of the content to be identified corresponding to a classification, and the first determining module includes:
the first obtaining submodule is used for obtaining the identification parameters respectively corresponding to the similar content classification in the plurality of identification results;
a third determining sub-module, configured to determine that the target recognition result is a similar content category if the recognition parameters of the similar content categories satisfy similar recognition conditions, where the similar recognition conditions include:
the maximum value of the identification parameters corresponding to the similar content classification is larger than a first preset threshold value;
the maximum value of the identification parameters corresponding to the similar content classification is smaller than or equal to the first preset threshold, and the average value of the identification parameters corresponding to the similar content classification is larger than a second preset threshold.
Optionally, the first determining module further includes:
the second obtaining submodule is used for obtaining the identification parameters respectively corresponding to the low-quality content classification in the plurality of identification results;
a fourth determining sub-module, configured to determine that the target identification result is a low-quality content classification if the identification parameters of the low-quality content classifications satisfy a low-quality identification condition and the identification parameters of the similar content classifications do not satisfy a similar identification condition, where the low-quality identification condition includes:
the maximum value of the identification parameter corresponding to the low-quality content classification is greater than a third preset threshold value;
the maximum value of the identification parameters corresponding to the low-quality content classification is less than or equal to the third preset threshold, and the average value of the identification parameters corresponding to the low-quality content classification is greater than a fourth preset threshold.
Optionally, the first determining module further includes:
the third obtaining submodule is used for obtaining the identification parameters which respectively correspond to the original content classification in the plurality of identification results;
and the fifth determining submodule is used for determining that the target identification result is the original content classification if the maximum value of the identification parameters of the original content classifications is larger than a fifth preset threshold value, the identification parameters of the similar content classifications do not meet similar identification conditions, and the identification parameters of the low-quality content classifications do not meet low-quality identification conditions.
Optionally, the apparatus further comprises:
the second determination module is used for determining a target position of which the attention parameter is greater than a preset threshold value in fusion characteristics according to the fusion characteristics corresponding to a target content identification pair under the condition that the target identification result is determined to be similar content classification, wherein the target content identification pair is a content identification pair corresponding to the maximum value of the identification parameters of the similar content classification;
a third determining module, configured to determine, as overlapping content corresponding to the content to be identified, content corresponding to the target position in the candidate content of the target content identification pair;
and the output module is used for outputting the candidate content and the overlapped content in the target content identification pair.
Optionally, the dimensions correspond to feature extraction submodels of the content identification model one to one, and the content identification model is obtained by:
acquiring the training sample data, wherein the training sample data comprises sample content, associated content corresponding to the sample content and an associated label;
for each training sample data, determining corresponding dimension characteristics of the sample content and the associated content under multiple dimensions respectively according to the sample content, the associated content corresponding to the sample content and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the characteristics of the sample content and the associated content in the dimension to obtain a spliced characteristic corresponding to the dimension;
acquiring fusion characteristics corresponding to the training sample data according to the splicing characteristics corresponding to the multiple dimensions;
determining the recognition result of the training sample data based on the classification submodel and the fusion characteristics of the content recognition model;
and adjusting parameters of the content recognition model according to the recognition result and the associated label to obtain the trained content recognition model.
Referring now to FIG. 7, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving content to be identified; obtaining identification results of a plurality of content identification pairs according to a plurality of content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions, and determining the identification results based on the dimension characteristics, and the dimensions are used for representing various types of components in the content to be identified; and determining a target recognition result of the content to be recognized according to the plurality of recognition results.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, a receiving module may also be described as a "module that receives content to be identified".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides a content identification method according to one or more embodiments of the present disclosure, wherein the method includes:
receiving content to be identified;
obtaining identification results of a plurality of content identification pairs according to a plurality of content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions, and determining the identification results based on the dimension characteristics, and the dimensions are used for representing various types of components in the content to be identified;
and determining a target recognition result of the content to be recognized according to the plurality of recognition results.
Example 2 provides the method of example 1, wherein the dimensions correspond to feature extraction submodels of the content recognition model one to one, and obtaining recognition results of a plurality of content recognition pairs according to the content recognition models and a plurality of content recognition pairs including the content to be recognized includes:
for each content identification pair, determining corresponding dimension characteristics of the content to be identified and the candidate content under multiple dimensions respectively according to the content to be identified, the candidate content in the content identification pair and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the to-be-identified content and the candidate content in the dimension to obtain a spliced feature corresponding to the dimension;
acquiring fusion characteristics corresponding to the content identification pairs according to the splicing characteristics corresponding to the multiple dimensions;
and determining the recognition result of the content recognition pair based on the classification submodel of the content recognition model and the fusion characteristics.
Example 3 provides the method of example 2, wherein the obtaining, according to the splicing features corresponding to the multiple dimensions, the fusion feature corresponding to the content identification pair includes:
processing the splicing features based on a first attention layer aiming at the splicing features corresponding to each dimension to obtain attention features corresponding to the splicing features;
splicing the attention features under each dimension to obtain a multi-dimension splicing feature;
and processing the multi-dimensional splicing feature based on a second attention layer to obtain the fusion feature.
Example 4 provides the method of example 2, wherein the recognition result includes a recognition parameter that the content to be recognized corresponds to a classification, and the determining the recognition result of the content recognition pair based on the classification submodel of the content recognition model and the fusion feature includes:
obtaining feature vectors of which the fusion features respectively correspond to a plurality of classes according to the fusion features and the classification submodels, wherein the classes comprise similar content classes, low-quality content classes and original content classes;
and processing the feature vector to obtain the identification parameters of the fusion features corresponding to each classification so as to obtain the identification result.
Example 5 provides the method of example 1, wherein the recognition result includes a recognition parameter that the content to be recognized corresponds to a classification, and the determining the target recognition result of the content to be recognized according to a plurality of the recognition results includes:
acquiring identification parameters respectively corresponding to the similar content classification in a plurality of identification results;
if the identification parameters of the similar content classifications meet similar identification conditions, determining that the target identification result is a similar content classification, wherein the similar identification conditions comprise:
the maximum value of the identification parameters corresponding to the similar content classification is larger than a first preset threshold value;
the maximum value of the identification parameters corresponding to the similar content classification is smaller than or equal to the first preset threshold, and the average value of the identification parameters corresponding to the similar content classification is larger than a second preset threshold.
Example 6 provides the method of example 5, wherein the determining a target recognition result of the content to be recognized according to a plurality of the recognition results, further includes:
acquiring identification parameters respectively corresponding to low-quality content classification in a plurality of identification results;
if the identification parameters of the low-quality content classifications meet a low-quality identification condition and the identification parameters of the similar content classifications do not meet the similar identification condition, determining that the target identification result is the low-quality content classification, wherein the low-quality identification condition comprises:
the maximum value of the identification parameter corresponding to the low-quality content classification is greater than a third preset threshold value;
the maximum value of the identification parameters corresponding to the low-quality content classification is less than or equal to the third preset threshold, and the average value of the identification parameters corresponding to the low-quality content classification is greater than a fourth preset threshold.
Example 7 provides the method of example 6, wherein the determining a target recognition result of the content to be recognized according to a plurality of the recognition results, further includes:
acquiring identification parameters respectively corresponding to the classification of the original content in a plurality of identification results;
and if the maximum value of the identification parameters of the original content classifications is larger than a fifth preset threshold value, the identification parameters of the similar content classifications do not meet similar identification conditions, and the identification parameters of the low-quality content classifications do not meet low-quality identification conditions, determining that the target identification result is the original content classification.
Example 8 provides the method of any of examples 5-7, wherein the method further comprises, in accordance with one or more embodiments of the present disclosure:
under the condition that the target identification result is determined to be similar content classification, determining a target position of which the attention parameter is greater than a preset threshold value in fusion characteristics according to the fusion characteristics corresponding to a target content identification pair, wherein the target content identification pair is a content identification pair corresponding to the maximum value of the identification parameters of the similar content classification;
determining the content corresponding to the target position in the candidate content of the target content identification pair as the coincidence content corresponding to the content to be identified;
and outputting the candidate content and the coincidence content in the target content identification pair.
Example 9 provides the method of example 1, wherein the dimensions correspond one-to-one to feature extraction submodels of the content recognition model, the content recognition model obtained by:
acquiring the training sample data, wherein the training sample data comprises sample content, associated content corresponding to the sample content and an associated label;
for each training sample data, determining corresponding dimension characteristics of the sample content and the associated content under multiple dimensions respectively according to the sample content, the associated content corresponding to the sample content and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the characteristics of the sample content and the associated content in the dimension to obtain a spliced characteristic corresponding to the dimension;
acquiring fusion characteristics corresponding to the training sample data according to the splicing characteristics corresponding to the multiple dimensions;
determining the recognition result of the training sample data based on the classification submodel and the fusion characteristics of the content recognition model;
and adjusting parameters of the content recognition model according to the recognition result and the associated label to obtain the trained content recognition model.
Example 10 provides a content recognition apparatus according to one or more embodiments of the present disclosure, wherein the apparatus includes:
the receiving module is used for receiving the content to be identified;
the processing module is used for obtaining the identification results of the content identification pairs according to the content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining the dimensional characteristics of the content to be identified and the candidate content under multiple dimensions, and determining the identification results based on the dimensional characteristics, and the dimensions are used for representing multiple types of components in the content to be identified;
and the first determining module is used for determining a target recognition result of the content to be recognized according to a plurality of recognition results.
Example 11 provides a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processing device, implements the steps of the method of any of examples 1-9, in accordance with one or more embodiments of the present disclosure.
Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-9.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (12)

1. A method for identifying content, the method comprising:
receiving content to be identified;
obtaining identification results of a plurality of content identification pairs according to a plurality of content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining dimension characteristics of the content to be identified and the candidate content under a plurality of dimensions, and determining the identification results based on the dimension characteristics, and the dimensions are used for representing various types of components in the content to be identified;
and determining a target recognition result of the content to be recognized according to the plurality of recognition results.
2. The method of claim 1, wherein the dimensions correspond to feature extraction submodels of the content identification model one to one, and obtaining the identification results of the content identification pairs according to the content identification pairs and the content identification model containing the content to be identified comprises:
for each content identification pair, determining corresponding dimension characteristics of the content to be identified and the candidate content under multiple dimensions respectively according to the content to be identified, the candidate content in the content identification pair and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the to-be-identified content and the candidate content in the dimension to obtain a spliced feature corresponding to the dimension;
acquiring fusion characteristics corresponding to the content identification pairs according to the splicing characteristics corresponding to the multiple dimensions;
and determining the recognition result of the content recognition pair based on the classification submodel of the content recognition model and the fusion characteristics.
3. The method according to claim 2, wherein the obtaining of the fusion feature corresponding to the content identification pair according to the splicing feature corresponding to the plurality of dimensions comprises:
processing the splicing features based on a first attention layer aiming at the splicing features corresponding to each dimension to obtain attention features corresponding to the splicing features;
splicing the attention features under each dimension to obtain a multi-dimension splicing feature;
and processing the multi-dimensional splicing feature based on a second attention layer to obtain the fusion feature.
4. The method of claim 2, wherein the recognition result comprises a recognition parameter corresponding to a classification of the content to be recognized, and the determining the recognition result of the content recognition pair based on a classification submodel of the content recognition model and the fusion feature comprises:
obtaining feature vectors of which the fusion features respectively correspond to a plurality of classes according to the fusion features and the classification submodels, wherein the classes comprise similar content classes, low-quality content classes and original content classes;
and processing the feature vector to obtain the identification parameters of the fusion features corresponding to each classification so as to obtain the identification result.
5. The method according to claim 1, wherein the recognition result comprises a recognition parameter corresponding to a classification of the content to be recognized, and the determining the target recognition result of the content to be recognized according to a plurality of recognition results comprises:
acquiring identification parameters respectively corresponding to the similar content classification in a plurality of identification results;
if the identification parameters of the similar content classifications meet similar identification conditions, determining that the target identification result is a similar content classification, wherein the similar identification conditions comprise:
the maximum value of the identification parameters corresponding to the similar content classification is larger than a first preset threshold value;
the maximum value of the identification parameters corresponding to the similar content classification is smaller than or equal to the first preset threshold, and the average value of the identification parameters corresponding to the similar content classification is larger than a second preset threshold.
6. The method according to claim 5, wherein the determining a target recognition result of the content to be recognized according to a plurality of the recognition results further comprises:
acquiring identification parameters respectively corresponding to low-quality content classification in a plurality of identification results;
if the identification parameters of the low-quality content classifications meet a low-quality identification condition and the identification parameters of the similar content classifications do not meet the similar identification condition, determining that the target identification result is the low-quality content classification, wherein the low-quality identification condition comprises:
the maximum value of the identification parameter corresponding to the low-quality content classification is greater than a third preset threshold value;
the maximum value of the identification parameters corresponding to the low-quality content classification is less than or equal to the third preset threshold, and the average value of the identification parameters corresponding to the low-quality content classification is greater than a fourth preset threshold.
7. The method according to claim 6, wherein the determining a target recognition result of the content to be recognized according to a plurality of the recognition results further comprises:
acquiring identification parameters respectively corresponding to the classification of the original content in a plurality of identification results;
and if the maximum value of the identification parameters of the original content classifications is larger than a fifth preset threshold value, the identification parameters of the similar content classifications do not meet similar identification conditions, and the identification parameters of the low-quality content classifications do not meet low-quality identification conditions, determining that the target identification result is the original content classification.
8. The method according to any one of claims 5-7, further comprising:
under the condition that the target identification result is determined to be similar content classification, determining a target position of which the attention parameter is greater than a preset threshold value in fusion characteristics according to the fusion characteristics corresponding to a target content identification pair, wherein the target content identification pair is a content identification pair corresponding to the maximum value of the identification parameters of the similar content classification;
determining the content corresponding to the target position in the candidate content of the target content identification pair as the coincidence content corresponding to the content to be identified;
and outputting the candidate content and the coincidence content in the target content identification pair.
9. The method of claim 1, wherein the dimensions correspond one-to-one to feature extraction submodels of the content recognition model, the content recognition model being obtained by:
acquiring the training sample data, wherein the training sample data comprises sample content, associated content corresponding to the sample content and an associated label;
for each training sample data, determining corresponding dimension characteristics of the sample content and the associated content under multiple dimensions respectively according to the sample content, the associated content corresponding to the sample content and the feature extraction submodel corresponding to each dimension;
for each dimension, splicing the characteristics of the sample content and the associated content in the dimension to obtain a spliced characteristic corresponding to the dimension;
acquiring fusion characteristics corresponding to the training sample data according to the splicing characteristics corresponding to the multiple dimensions;
determining the recognition result of the training sample data based on the classification submodel and the fusion characteristics of the content recognition model;
and adjusting parameters of the content recognition model according to the recognition result and the associated label to obtain the trained content recognition model.
10. An apparatus for identifying content, the apparatus comprising:
the receiving module is used for receiving the content to be identified;
the processing module is used for obtaining the identification results of the content identification pairs according to the content identification pairs containing the content to be identified and a content identification model, wherein each content identification pair further comprises a candidate content in a preset set, the content identification model is used for obtaining the dimensional characteristics of the content to be identified and the candidate content under multiple dimensions, and determining the identification results based on the dimensional characteristics, and the dimensions are used for representing multiple types of components in the content to be identified;
and the first determining module is used for determining a target recognition result of the content to be recognized according to a plurality of recognition results.
11. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-9.
12. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.
CN202111235927.XA 2021-10-22 2021-10-22 Content identification method, device, medium and electronic equipment Pending CN113971402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111235927.XA CN113971402A (en) 2021-10-22 2021-10-22 Content identification method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111235927.XA CN113971402A (en) 2021-10-22 2021-10-22 Content identification method, device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113971402A true CN113971402A (en) 2022-01-25

Family

ID=79588054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111235927.XA Pending CN113971402A (en) 2021-10-22 2021-10-22 Content identification method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113971402A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463567A (en) * 2022-04-12 2022-05-10 北京吉道尔科技有限公司 Block chain-based intelligent education operation big data plagiarism prevention method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463567A (en) * 2022-04-12 2022-05-10 北京吉道尔科技有限公司 Block chain-based intelligent education operation big data plagiarism prevention method and system

Similar Documents

Publication Publication Date Title
CN110321958B (en) Training method of neural network model and video similarity determination method
CN109919244B (en) Method and apparatus for generating a scene recognition model
CN110674349B (en) Video POI (Point of interest) identification method and device and electronic equipment
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN109961032B (en) Method and apparatus for generating classification model
CN112364829B (en) Face recognition method, device, equipment and storage medium
CN115294501A (en) Video identification method, video identification model training method, medium and electronic device
CN109816023B (en) Method and device for generating picture label model
CN113033707B (en) Video classification method and device, readable medium and electronic equipment
CN112990176B (en) Writing quality evaluation method and device and electronic equipment
CN113992944A (en) Video cataloging method, device, equipment, system and medium
CN114445754A (en) Video processing method and device, readable medium and electronic equipment
CN113971402A (en) Content identification method, device, medium and electronic equipment
CN115346145A (en) Method, device, storage medium and computer program product for identifying repeated video
CN111327960B (en) Article processing method and device, electronic equipment and computer storage medium
CN113011169A (en) Conference summary processing method, device, equipment and medium
CN112309389A (en) Information interaction method and device
CN113033682B (en) Video classification method, device, readable medium and electronic equipment
CN115359400A (en) Video identification method, device, medium and electronic equipment
CN113987264A (en) Video abstract generation method, device, equipment, system and medium
CN114187557A (en) Method, device, readable medium and electronic equipment for determining key frame
CN114428867A (en) Data mining method and device, storage medium and electronic equipment
CN112699687A (en) Content cataloging method and device and electronic equipment
CN113420723A (en) Method and device for acquiring video hotspot, readable medium and electronic equipment
CN113343069A (en) User information processing method, device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination