CN110413603A - Determination method, apparatus, electronic equipment and the computer storage medium of repeated data - Google Patents

Determination method, apparatus, electronic equipment and the computer storage medium of repeated data Download PDF

Info

Publication number
CN110413603A
CN110413603A CN201910723196.XA CN201910723196A CN110413603A CN 110413603 A CN110413603 A CN 110413603A CN 201910723196 A CN201910723196 A CN 201910723196A CN 110413603 A CN110413603 A CN 110413603A
Authority
CN
China
Prior art keywords
data
feature vector
image
video
reference data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910723196.XA
Other languages
Chinese (zh)
Other versions
CN110413603B (en
Inventor
李�根
何轶
陈扬羽
李磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201910723196.XA priority Critical patent/CN110413603B/en
Publication of CN110413603A publication Critical patent/CN110413603A/en
Application granted granted Critical
Publication of CN110413603B publication Critical patent/CN110413603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

Present disclose provides determination method, apparatus, electronic equipment and the computer storage mediums of a kind of repeated data, this method comprises: obtaining reference data;The second feature vector of first data type of the first eigenvector of the first data type based on reference data and each data to be compared, determine in data to be compared with the duplicate candidate data of reference data;According to the fourth feature vector of the second data type of each data in the third feature vector and candidate data of the second data type of reference data, determined from candidate data and the duplicate data of reference data.In embodiment of the disclosure, the second feature vector of first eigenvector and each data to be compared based on reference data, determine in data to be compared with the duplicate candidate data of reference data, improve data-handling efficiency, based on the fourth feature vector of each data in third feature vector and candidate data, can accurately be determined from candidate data and the duplicate data of reference data.

Description

Determination method, apparatus, electronic equipment and the computer storage medium of repeated data
Technical field
This disclosure relates to technical field of data processing, specifically, this disclosure relates to a kind of determination method of repeated data, Device, electronic equipment and computer storage medium.
Background technique
In the prior art, the data in database usually have some close or identical data, based in database Data carry out relevant treatment when, usually can because of the usage experience of these same or similar data influence data, such as What fast and accurately determines that the repeated data (same or similar data) in database is current urgent problem to be solved.
Summary of the invention
The purpose of the disclosure is intended at least can solve above-mentioned one of technological deficiency, improves data-handling efficiency and repeat number According to determining accuracy.The disclosure the technical solution adopted is as follows:
In a first aspect, present disclose provides a kind of determination methods of repeated data, this method comprises:
Reference data is obtained, reference data is image or video;
First data class of the first eigenvector of the first data type based on reference data and each data to be compared The second feature vector of type, determine in data to be compared with the duplicate candidate data of reference data, wherein data to be compared include At least one of image and video;
According to the second number of each data in the third feature vector and candidate data of the second data type of reference data According to the fourth feature vector of type, determined from candidate data and the duplicate data of reference data, wherein the second data type Data precision be higher than the first data type data precision.
In the embodiment of disclosure first aspect, the first data type is integer, and the second data type is floating type.
In the embodiment of disclosure first aspect, the first eigenvector of the first data type based on reference data, with And the second feature vector of the first data type of each data to be compared, determine in data to be compared with the duplicate time of reference data Select data, comprising:
By with first eigenvector have at least one dimension identical value second feature vector corresponding to data, It is determined as candidate data.
In the embodiment of disclosure first aspect, by the identical value with first eigenvector at least one dimension Data corresponding to second feature vector, are determined as candidate data, comprising:
Inverted index is obtained, inverted index is to establish based on second feature vector;
Based on first eigenvector and inverted index, with first eigenvector will there is the identical of at least one dimension to take Data corresponding to the second feature vector of value, are determined as candidate data.
In the embodiment of disclosure first aspect, according to the third feature vector of the second data type of reference data, with And in candidate data the second data type of each data fourth feature vector, determined from candidate data and reference data weight Multiple data, comprising:
Determine the characteristic similarity between third feature vector and fourth feature vector;
It will be greater than candidate data corresponding to the characteristic similarity of similarity threshold, be determined as and the duplicate number of reference data According to.
In the embodiment of disclosure first aspect, determine that the feature between third feature vector and fourth feature vector is similar Degree, comprising:
Based on third feature vector and fourth feature vector, determine between third feature vector and fourth feature vector COS distance;
According to COS distance, characteristic similarity is determined.
In the embodiment of disclosure first aspect, reference data is the data in first database, and data to be compared are the Data in one database in addition to the data;
Alternatively,
Obtain reference data, comprising:
Obtain search key;
According to search key, search result is obtained from the second database, reference data is the data in search result, Data to be compared are the data in search result in addition to the data, or are the data in the second database.
In the embodiment of disclosure first aspect, if reference data is the data in first database, determining and base After the data of quasi- Data duplication, this method further include:
Based on the duplicate data of reference data, to first database carry out duplicate removal;
If data to be compared are data in addition to reference data in search result, determine it is duplicate with reference data After data, this method further include:
Based on the duplicate data of reference data, to search result carry out duplicate removal;Using the search result after duplicate removal as most Whole search result;Alternatively, being deleted to reference data and with the duplicate data of reference data;
If data to be compared are the data in the second database, determining to be somebody's turn to do with after the duplicate data of reference data Method further include:
Based on the duplicate data of reference data, to the second database carry out duplicate removal.
In the embodiment of disclosure first aspect, if reference data is video, the first eigenvector of reference data and the Three feature vectors are determining in the following manner:
Extract the characteristics of image of the frame image in video;
Based on the characteristics of image of extracted frame image, the first eigenvector and third feature vector of video are determined.
In the embodiment of disclosure first aspect, based on the characteristics of image of extracted frame image, the first of video is determined Feature vector and third feature vector, comprising:
Based on the characteristics of image of extracted frame image, the average value of the characteristics of image of each frame image in frame image is determined, Obtain the average image feature;Based on the average image feature, the first eigenvector and third feature vector of video are determined;
Alternatively,
Based on the characteristics of image of extracted frame image, the first eigenvector and third of each frame image in frame image are determined Feature vector;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of video, by each frame image The average value of third feature vector be determined as the third feature vector of video.
Second aspect, present disclose provides a kind of determining device of repeated data, which includes:
Data acquisition module, for obtaining reference data, reference data is image or video;
Candidate data determining module, for the first eigenvector of the first data type based on reference data, and it is each The second feature vector of first data type of data to be compared, determine in data to be compared with the duplicate candidate number of reference data According to, wherein data to be compared include at least one of image and video;
Repeated data determining module, for the third feature vector according to the second data type of reference data, Yi Jihou The fourth feature vector for selecting the second data type of each data in data is determined duplicate with reference data from candidate data Data, wherein the precision of the data of the second data type is higher than the precision of the data of the first data type.
In the embodiment of disclosure second aspect, the first data type is integer, and the second data type is floating type.
In the embodiment of disclosure second aspect, candidate data determining module is in the first data type based on reference data First eigenvector and each data to be compared the first data type second feature vector, determine in data to be compared When candidate data duplicate with reference data, it is specifically used for:
By with first eigenvector have at least one dimension identical value second feature vector corresponding to data, It is determined as candidate data.
In the embodiment of disclosure second aspect, candidate data determining module will have at least one with first eigenvector Data corresponding to the second feature vector of the identical value of a dimension, when being determined as candidate data, are specifically used for:
Inverted index is obtained, inverted index is to establish based on second feature vector;
Based on first eigenvector and inverted index, with first eigenvector will there is the identical of at least one dimension to take Data corresponding to the second feature vector of value, are determined as candidate data.
In the embodiment of disclosure second aspect, repeated data determining module is in the second data type according to reference data Third feature vector and candidate data in each data the second data type fourth feature vector, from candidate data When determining data duplicate with reference data, it is specifically used for:
Determine the characteristic similarity between third feature vector and fourth feature vector;
It will be greater than candidate data corresponding to the characteristic similarity of similarity threshold, be determined as and the duplicate number of reference data According to.
In the embodiment of disclosure second aspect, repeated data determining module is determining third feature vector and fourth feature When characteristic similarity between vector, it is specifically used for:
Based on third feature vector and fourth feature vector, determine between third feature vector and fourth feature vector COS distance;
According to COS distance, characteristic similarity is determined.
In the embodiment of disclosure second aspect, reference data is the data in first database, and data to be compared are the Data in one database in addition to the data;
Alternatively,
Data acquisition module is specifically used for when obtaining reference data:
Obtain search key;
According to search key, search result is obtained from the second database, reference data is the data in search result, Data to be compared are the data in search result in addition to the data, or are the data in the second database.
In the embodiment of disclosure second aspect, if reference data is the data in first database, determining and base After the data of quasi- Data duplication, the device further include:
First data processing module, for based on the duplicate data of reference data, to first database carry out duplicate removal;
If data to be compared are data in addition to reference data in search result, determine it is duplicate with reference data After data, the device further include:
Second data processing module, for based on the duplicate data of reference data, to search result carry out duplicate removal;It will go Search result after weight is as final search result;Alternatively, being deleted to reference data and with the duplicate data of reference data It removes;
If data to be compared are the data in the second database, determining to be somebody's turn to do with after the duplicate data of reference data Device further include:
Third data processing module, for based on the duplicate data of reference data, to the second database carry out duplicate removal.
In the embodiment of disclosure second aspect, if reference data is video, which further includes that feature vector determines mould Block, for determining the first eigenvector of reference data in the following manner:
Extract the characteristics of image of the frame image in video;
Based on the characteristics of image of extracted frame image, the first eigenvector and third feature vector of video are determined.
In the embodiment of disclosure second aspect, feature vector determining module is special in the image based on extracted frame image Sign, when determining the first eigenvector and third feature vector of video, is specifically used for:
Based on the characteristics of image of extracted frame image, the average value of the characteristics of image of each frame image in frame image is determined, Obtain the average image feature;Based on the average image feature, the first eigenvector and third feature vector of video are determined;
Alternatively,
Based on the characteristics of image of extracted frame image, the first eigenvector and third of each frame image in frame image are determined Feature vector;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of video, by each frame image The average value of third feature vector be determined as the third feature vector of video.
The third aspect, present disclose provides a kind of electronic equipment, which includes:
Processor and memory;
Memory, for storing computer operation instruction;
Processor, for executing any embodiment of the first aspect such as the disclosure by calling computer operation instruction Shown in method.
Fourth aspect, present disclose provides a kind of computer readable storage medium, which is stored at least one Instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, at least a Duan Chengxu, code set or instruction set by Reason device is loaded and is executed in the method as shown in any embodiment of the first aspect of the disclosure of realization.
The technical solution that the embodiment of the present disclosure provides has the benefit that
Determination method, apparatus, electronic equipment and the computer storage medium of the repeated data of the embodiment of the present disclosure, the first number According to the feature vector (third feature of feature vector (first eigenvector and the second feature vector) and the second data type of type Vector sum fourth feature vector) the characteristics of can reflecting image, the precision of the feature vector of the second data type is higher than the The feature vector of one data type, compared to the feature vector of the first data type, the feature vector of the second data type can be with The characteristics of reacting image in further detail, therefore can the first first eigenvector based on reference data and each data to be compared Second feature vector, determine in data to be compared with the duplicate candidate data of reference data, improve data-handling efficiency, into One step, the fourth feature vector of each data in third feature vector and candidate data based on reference data can be accurate Slave candidate data in determine with the duplicate data of reference data, thus, can be quick and precisely based on the scheme in the disclosure Determined from data to be compared and the duplicate data of reference data.In addition, the scheme based on the disclosure, may be implemented to be based on The feature (single features) of image or video, from data to be compared (blended data, while the data including video and image) Determine duplicate data.
Detailed description of the invention
It, below will be to institute in embodiment of the present disclosure description in order to illustrate more clearly of the technical solution in the embodiment of the present disclosure Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram of the determination method for repeated data that embodiment of the disclosure provides;
Fig. 2 is a kind of structural schematic diagram of the determining device for repeated data that embodiment of the disclosure provides;
Fig. 3 is the structural schematic diagram for a kind of electronic equipment that embodiment of the disclosure provides.
Specific embodiment
Embodiment of the disclosure is described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein phase from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached drawing The embodiment of description is exemplary, and is only used for explaining the technical solution of the disclosure, and cannot be construed to the limitation to the disclosure.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, "one" It may also comprise plural form with "the".It is to be further understood that wording " comprising " used in the specification of the disclosure is Refer to that there are this feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition it is one or more its His feature, integer, step, operation, element, component and/or their group.It should be understood that when we claim element be " connected " or When " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be intermediary elements.This Outside, " connection " or " coupling " used herein may include being wirelessly connected or wirelessly coupling.Wording "and/or" packet used herein Include one or more associated wholes for listing item or any cell and all combination.
How the technical solution of the disclosure and the technical solution of the disclosure are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiment of the disclosure is described.
The determination method for a kind of repeated data that embodiment of the disclosure provides, as shown in Figure 1, this method may include:
Step S110, obtains reference data, and reference data is image or video.
Wherein, the source embodiment of the present disclosure of reference data is without limitation.Specifically, reference data can be needed based on practical Ask determining, for example, reference data can be the image or video pre-seted;Reference data may be the number in first database According to can be the image or video in black list database such as reference data;Reference data can also be for based on search key The data got such as can be the picture or video arrived based on keyword search, wherein keyword can be to be got in real time Search key, or be pre-configured keyword, for example, keyword can when needing to screen violated image or video To be violated word, sensitive word etc..
As an example, for example, keyword be A, find in the database picture corresponding with keyword A and/ Or when video, image usually corresponding with some keyword and/or video are image or video works that are multiple, then can finding For benchmark data, can such as sort, using in lookup result first image or video as reference data.
Step S120, the first eigenvector of the first data type based on reference data and each data to be compared The second feature vector of first data type, determine in data to be compared with the duplicate candidate data of reference data, wherein to than It include at least one of image and video to data.
Specifically, characteristics of image can reflect the characteristics of reference data, feature vector reactive group can be passed through in the disclosure The characteristics of quasi- data, then the of the first data type of the first eigenvector based on reference data and each data to be compared Two feature vectors can determine second feature vector close or identical with the first eigenvector from data to be compared, Then data corresponding to second feature vector close or identical with first eigenvector are and the duplicate candidate of reference data Data, the data in candidate data may be entirely and the duplicate data of reference data, it is also possible to be not all and reference data Duplicate data, that is to say, that data in candidate data can be all with the duplicate data of reference data, is also possible to it In partial data be can be from number to be compared by the first eigenvector of reference data with the duplicate data of reference data Filter out in mass data in may be to mention with the duplicate candidate data of reference data so as to reduce data processing amount Rise data-handling efficiency.
Wherein, the source of data to be compared without limitation, can be the data in the same database, be also possible to Data from disparate databases.
In the scheme of the disclosure, referred to and the same or similar number of the reference data with the duplicate data of reference data According to reference data can be video or image, can be image or video with the same or similar data of reference data, to be compared Data may include at least one of video and image, if data to be compared include video and image, in candidate data Also it may include at least one of video and image, that is, be based on image or video, can be determined from data to be compared Candidate data can be image, can be video, be also possible to video and image.
Step S130, according to each number in the third feature vector and candidate data of the second data type of reference data According to the second data type fourth feature vector, determined from candidate data and the duplicate data of reference data, wherein The precision of the data of two data types is higher than the precision of the data of the first data type.
Specifically, the precision of the data of the second data type is higher than the precision of the data of the first data type, then it is corresponding, The precision of third feature vector sum fourth feature vector is higher than the precision of first eigenvector and second feature vector, passes through second The third feature vector of data type can more accurate description image the characteristics of, then based on the third feature of reference data to The fourth feature vector of second data type of each data in amount and candidate data, can be more accurate from candidate data Determine and the duplicate data of reference data, it is to be understood that in candidate data include image and video at least one Kind, it and equally also may include at least one of image and video in the duplicate data of reference data.
Scheme in embodiment of the disclosure, feature vector (first eigenvector and the second feature of the first data type Vector) and the feature vector (third feature vector sum fourth feature vector) of the second data type can reflect the spy of image Point, the precision of the feature vector of the second data type is higher than the feature vector of the first data type, compared to the first data type Feature vector, the characteristics of feature vector of the second data type can react image in further detail, therefore can first be based on The second feature vector of the first eigenvector of reference data and each data to be compared, determine in data to be compared with base value According to duplicate candidate data, data-handling efficiency is improved, further, the third feature vector based on reference data, and The fourth feature vector of each data in candidate data can be determined accurately from candidate data and the duplicate number of reference data According to, thus, based on the scheme in the disclosure, can fast and accurately be determined from data to be compared duplicate with reference data Data.In addition, the scheme based on the disclosure, may be implemented the feature (single features) based on image or video, from number to be compared According to determining duplicate data in (blended data, while the data including video and image).
In embodiment of the disclosure, the first data type can be integer, and the second data type can be floating type.
Specifically, real-coded GA is higher than the precision of integer data.In the scheme of the disclosure, for ease of understanding, hereafter Middle that first eigenvector is described as the first shaped characteristic vector, second feature vector description is the second shaped characteristic vector, the Three feature vectors are described as the first floating type feature vector, and fourth feature vector description is the second floating type feature vector.
In embodiment of the disclosure, if reference data be video, the first eigenvector of reference data and third feature to Amount is determining in the following manner:
Extract the characteristics of image of the frame image in video;
Based on the characteristics of image of extracted frame image, the first eigenvector and third feature vector of video are determined.
Specifically, including that sequential frame image, then first eigenvector and third are special in video if reference data is video Sign vector, which can be, to be determined based on the characteristics of image of the frame image in video, wherein frame image in video can according to Under type obtains, first way, using the whole frame image in video as the frame image of the video;The second way, from view Corresponding image is uniformly extracted in frequency as frame image, for example, extracting corresponding image conduct from video according to preset interval Frame image, preset interval can be configured based on actual demand, for example, preset interval is 5, then open figure every 5 frame image contracts one As the frame image as the video;The third mode extracts corresponding image according to key frame as frame image from video, closes Key frame can be configured based on actual demand, for example, key frame is the 5th frame, the 25th frame, the 38th frame, then in corresponding extraction video 5th frame, the 25th frame, frame image of the 38th frame as the video.
In embodiment of the disclosure, based on the characteristics of image of extracted frame image, the first eigenvector of video is determined May include following any mode with third feature vector:
Based on the characteristics of image of extracted frame image, the average value of the characteristics of image of each frame image in frame image is determined, Obtain the average image feature;Based on the average image feature, the first eigenvector and third feature vector of video are determined;
Alternatively,
Based on the characteristics of image of extracted frame image, the first eigenvector and third of each frame image in frame image are determined Feature vector;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of video, by each frame image The average value of third feature vector be determined as the third feature vector of video.
Specifically, the first eigenvector and third feature vector of reference data are if reference data is video The first eigenvector and third feature vector of video, first eigenvector and third feature vector are for image, i.e., One feature vector and third feature vector are usually the corresponding feature vector of image, then can be determined by following any mode Obtain the first eigenvector and third feature vector of video:
Based on the characteristics of image of extracted frame image, determine that the first eigenvector of video may include following at least one Kind mode:
The first determines the characteristics of image of each frame image in frame image based on the characteristics of image of extracted frame image Average value obtains the average image feature;Based on the average image feature, the first eigenvector of video is determined.
Second, based on the characteristics of image of extracted frame image, determine the fisrt feature of each frame image in frame image to Amount;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of video.
Based on the characteristics of image of extracted frame image, determine that the third feature vector of video may include following at least one Kind mode:
The first determines the characteristics of image of each frame image in frame image based on the characteristics of image of extracted frame image Average value obtains the average image feature;Based on the average image feature, the third feature vector of video is determined.
Second, based on the characteristics of image of extracted frame image, determine the third feature of each frame image in frame image to Amount;The average value of the third feature vector of each frame image is determined as to the third feature vector of video.
It is understood that the second feature vector sum fourth feature vector of the video in data to be compared can also pass through Aforesaid way determination obtains.
In embodiment of the disclosure, in step S120, the first eigenvector of the first data type based on reference data, And the second feature vector of the first data type of each data to be compared, it determines duplicate with reference data in data to be compared Candidate data may include:
By with first eigenvector have at least one dimension identical value second feature vector corresponding to data, It is determined as candidate data.
Specifically, may include the element of multiple dimensions in shaped characteristic vector, in the scheme of the disclosure, image is corresponding Shaped characteristic vector is usually one-dimensional characteristic vector, for example, shaped characteristic vector can indicate are as follows: A=[a1, a2 ... ..., A9], wherein A indicates that shaped characteristic vector, a1, a2 ... ..., a9 indicate that the element in shaped characteristic vector, different elements take Value may be identical, it is also possible to different.
It is whether similar by shaped characteristic vector determination between two different data, it can be based on each in shaped characteristic vector The value of element judges, can be with if at least the value with a dimension is identical in the shaped characteristic vector of two data Indicate that the two data are similar.
As an example, if two data are respectively data A and data B, in the corresponding shaped characteristic vector of data A Including a1, a2 ... ..., a9 several elements include b1 in the corresponding shaped characteristic vector of data B, the several elements of b2 ... ..., b9, If the value of a1 and b1 is identical, it may determine that data A is similar with data B.
In embodiment of the disclosure, by the second feature of the identical value with first eigenvector at least one dimension Data corresponding to vector, are determined as candidate data, may include:
Inverted index is obtained, inverted index is to establish based on second feature vector;
Based on first eigenvector and inverted index, with first eigenvector will there is the identical of at least one dimension to take Data corresponding to the second feature vector of value, are determined as candidate data.
Specifically, the second feature vector (second to the first data type of data in correlation data can be in advance based on Shaped characteristic vector) inverted index is established, it is then based on established inverted index and the first shaped characteristic vector, it can be from number According to the candidate data for obtaining reference data corresponding with the first shaped characteristic vector in library.Due to establishing inverted index, it is not required to First shaped characteristic vector of reference data is made comparisons one by one with the shaped characteristic vector to data each in correlation data, it can To further increase the treatment effeciency of data.
Wherein, the establishment process of inverted index can be with are as follows:
1, it obtains to the image and video in correlation data;
2, the frame image in selecting video, frame image are usually multiple images;Frame image in video can be by above The mode of description obtains, and details are not described herein.
3, the characteristics of image of image and frame image is extracted, does not limit the extracting method of characteristics of image in the disclosure, for example, CNN (Convolutional Neural Networks, convolutional neural networks) algorithm, SIFT (Scale-invariant Feature transform, Scale invariant features transform) algorithm, SURF (speed up robust features) algorithm, Vlad (vector of locally aggregated descriptors) algorithm etc.;
4, based on the characteristics of image of image, the shaped characteristic vector for obtaining image is obtained based on the characteristics of image of frame image The shaped characteristic vector of video, wherein the shaped characteristic vector of video can be determining based on hereinbefore described mode, herein not It repeating again, in the disclosure, shaped characteristic vector is the feature vector for obtain after binary conversion treatment to characteristics of image, for example, 01 string.
5, the shaped characteristic vector of the shaped characteristic vector sum video based on image, establishes inverted index, wherein for area Each video in divided data library can distinguish each video by video identifier, for example, video identifier can be video sequence Number, it, can be by distinguishing each image, for example, image identification can be likewise, in order to distinguish each image in database Picture numbers.
As an example, inverted index can be as shown in table 1:
Table 1
Vector dimension value Picture numbers Video serial number
A1 Image 1 Video 20
A2 Image 10 Video 67
…… …… ……
A100 Image 1 and image 5 Video 10 and video 20
…… …… ……
In inverted index, the vector dimension value of shaped characteristic vector is corresponding to be existed in which image or video, just Indicate the corresponding data of shaped characteristic vector with the video and image be it is similar, as shown in table 1, shaped characteristic vector Vector dimension value A1 is corresponding to be existed in image 1 and video 20, then it represents that the corresponding data of shaped characteristic vector (video or Image) candidate data be image 1 and video 20;The vector dimension value A2 of shaped characteristic vector is corresponding to be existed in 10 He of image In video 67, indicate that the candidate data of the corresponding data of shaped characteristic vector (video or image) is image 10 and video 67; The vector dimension value A100 of shaped characteristic vector exists in image 1, image 5, video 10 and video 20, indicates integer spy The candidate data for levying the corresponding data (video or image) of vector is image 1, image 5, video 10 and video 20.Pass through as a result, Inverted index, if it is determined that the shaped characteristic vector of some reference data can then be based on the inverted index, determine with should The corresponding image of shaped characteristic vector and video, the i.e. corresponding candidate data of shaped characteristic vector.
It should be noted that in practical applications, image and video can establish the corresponding row's of falling rope respectively, can also incite somebody to action Image and video, which are put together, establishes inverted index, is not construed as limiting in the disclosure.
In embodiment of the disclosure, in step S130, according to the third feature vector of the second data type of reference data, And in candidate data the second data type of each data fourth feature vector, determined from candidate data and reference data Duplicate data may include:
Determine the characteristic similarity between third feature vector and fourth feature vector;
It will be greater than candidate data corresponding to the characteristic similarity of similarity threshold, be determined as and the duplicate number of reference data According to.
Specifically, can be determined from candidate data by the similarity between feature vector duplicate with reference data Data can configure a default similarity threshold in practical applications, if the characteristic similarity between two data is not small In the similarity threshold, then it can indicate that the two data are similar, if the characteristic similarity between two data is less than the phase Like degree threshold value, then the two data dissmilarity can be indicated.
In embodiment of the disclosure, the characteristic similarity between third feature vector and fourth feature vector is determined, it can be with Include:
Determine the COS distance between third feature vector and fourth feature vector;
According to COS distance, characteristic similarity is determined.
Specifically, the similarity between two feature vectors can be determined by the COS distance between feature vector, lead to COS distance between normal two feature vectors is bigger, indicates that the similarity between the two feature vectors is lower, opposite, two COS distance between a feature vector is smaller, indicates that the similarity between the two feature vectors is higher, i.e., more similar.
In embodiment of the disclosure, if reference data is the first image, includes the first video in candidate data, be based on third Feature vector and fourth feature vector, determine the characteristic similarity between third feature vector and fourth feature vector, can be with Comprise at least one of the following mode:
The first, of the frame image to be processed in the first floating type feature vector based on the first image and the first video Two floating type feature vectors determine that the second floating type of each frame image in the first floating type feature vector and frame image to be processed is special Levy the first similarity between vector;Based on the first similarity, the average value of the first similarity is determined as characteristic similarity.
Second, based on the second floating type feature vector of each frame image in the frame image to be processed in the first video, really The average value of second floating type feature vector of fixed each frame image, obtains third floating type feature vector;Based on third floating type Feature vector and the first floating type feature vector, will be similar between third floating type feature vector and the first floating type feature vector Degree is determined as characteristic similarity.
Specifically, wherein the frame image to be processed in the first video can be obtained by hereinbefore described mode, be obtained Mode includes but is not limited to: including uniformly taking out frame to the first video, key frame takes out frame, makees the whole frame image in the first video For any one of image to be processed, specific acquisition modes are repeated no more.
If reference data is the second video, includes the second image in candidate data, determine third feature vector and the 4th spy The characteristic similarity between vector is levied, may include following at least one:
The first, the first floating type feature vector based on each frame image in the frame image to be processed in the second video and the Two floating type feature vectors determine that the first floating type feature vector of each frame image in frame image to be processed and the second floating type are special Levy the second similarity between vector;Based on the second similarity, the average value of the second similarity is determined as characteristic similarity.
Second, based on the first floating type feature vector of each frame image in the frame image to be processed in the second video, really The average value of first floating type feature vector of fixed each frame image, obtains the 4th floating type feature vector;Based on the 4th floating type Feature vector and the second floating type feature vector, by the phase between the 4th floating type feature vector and the second floating type feature vector It is determined as characteristic similarity like degree.
Specifically, the frame image to be processed in the second video can be obtained by hereinbefore described mode, acquisition modes Including but not limited to: frame uniformly being taken out to the second video, key frame takes out frame, using the whole frame image in the second video as to be processed Any one of image, specific acquisition modes repeat no more.
If reference data is third video, includes the 4th video in candidate data, determine third feature vector and the 4th spy The characteristic similarity between vector is levied, may include following at least one:
The first, the first floating type feature vector based on each frame image in the frame image to be processed in third video and Second floating type feature vector of each frame image in frame image to be processed in four videos, determine each first floating type feature to Similarity between amount and each second floating type feature vector, obtains third similarity, and the average value of third similarity is true It is set to characteristic similarity.
Second, based on the first floating type feature vector of each frame image in the frame image to be processed in third video, really It is floating to obtain the 5th for the average value for determining the first floating type feature vector of each frame image in the frame image to be processed in third video Point-type feature vector;Based on the floating type feature vector of each frame image in the frame image to be processed in the 4th video, the 4th is determined The average value of second floating type feature vector of each frame image, obtains the 6th floating type feature in frame image to be processed in video Vector;Similarity between 5th floating type feature vector and the 6th floating type feature vector is determined as characteristic similarity.
Specifically, the frame image to be processed in third video and the 4th video can be obtained by hereinbefore described mode Take, acquisition modes include but is not limited to: including carrying out uniformly taking out frame to third video or the 4th video, key frame takes out frame, by the Whole frame image in three videos or the 4th video is as any one of corresponding image to be processed, and specific acquisition modes are no longer It repeats.
In embodiment of the disclosure, reference data is the data in first database, and data to be compared are first database In data in addition to the data;
Alternatively,
Reference data is obtained, may include:
Obtain search key;
According to search key, search result is obtained from the second database, reference data is the data in search result, Data to be compared are the data in search result in addition to the data, or are the data in the second database.
Specifically, first database and the second database can be the same database, it is also possible to different databases, It is not construed as limiting in the disclosure.Search key it is available to the keyword that provides in real time of user, search key can also be with It is preconfigured keyword, for example, the keyword in blacklist, for example, violated word, sensitive word etc..
In embodiment of the disclosure, if reference data is the data in first database, determining and reference data weight After multiple data, this method can also include:
Based on the duplicate data of reference data, to first database carry out duplicate removal;
If data to be compared are data in addition to reference data in search result, determine it is duplicate with reference data After data, this method can also include:
Based on the duplicate data of reference data, to search result carry out duplicate removal;Using the search result after duplicate removal as most Whole search result;Alternatively, being deleted to reference data and with the duplicate data of reference data;
If data to be compared are the data in the second database, determining to be somebody's turn to do with after the duplicate data of reference data Method can also include:
Based on the duplicate data of reference data, to the second database carry out duplicate removal.
Specifically, being directed to different application scenarios, can be carried out not to what is determined with the duplicate data of reference data Same processing, is illustrated below with reference to specific application scenarios:
The first application scenarios carries out duplicate removal processing to the repeated data in first database:
In practical applications, there may be a large amount of repeated datas in database, are based on these repeated datas, may be straight Connect influence data can usage experience.
The first situation is determined when reference data is the data in first database based on hereinbefore described scheme Out and after the duplicate data of reference data, duplicate removal processing can be carried out to the data in first database based on repeated data, with The data volume in first database is reduced, and then the usage experience of database can be improved, it is to be understood that can be based on same Mode determine different reference datas, the repeated data based on different reference datas to database carry out duplicate removal.
Second situation is determined if data to be compared are the data in the second database based on hereinbefore described method Out and after the duplicate data of the reference data, the data in the second database can be carried out at duplicate removal based on repeated data Reason.
Second of application scenarios provides search result based on the search key of user's input for user:
User is typically based on text information (keyword) and searches for corresponding image and/or video in practical applications, still Since there may be a large amount of duplicate data in the second database, so that including in the search result obtained based on search key Many duplicate data, reduce the usage experience of user.
By the scheme of the disclosure, it is based on search key, it can first search obtains from first database and the search is closed Then the corresponding search result of keyword optionally selects a data as reference data, then based on above in the search result Described method is determined in the search result for searching for search key, to remove with the duplicate data of the reference data Other data except the repeated data show user as final search result, do not have in the search result obtained in this way Duplicate data improve user's search experience.
As an example, for example, the keyword of user's input is " Liu Jialing ", then it is based on the keyword, may search for Obtain in first database include " Liu Jialing " this keyword some data (hereinafter referred to as the first search result), first It may include the image and video about Liu Jialing in search result, and in the first search result, have very to image and video Between be similar, if the first search result is showed user, the search experience of user may be reduced, pass through the disclosure Scheme, an image or video are randomly selected in the first search result as reference data, the method by being described above, Based on the reference data, determine the duplicate data of the reference data, then by the first search result in addition to repeated data Data show user as final search result, can include simultaneously video and image in final search result, to be supplied to The better search experience of user.
The third application scenarios, the keyword of the input based on user carry out undercarriage to data corresponding with the keyword:
In practical applications, for the maintenance personnel of database, if user want it is certain to what is stored in database Data carry out undercarriage, it will usually corresponding data are searched from database based on keyword, then by these data undercarriages, but The corresponding data of the keyword in database can not accurately be searched out by being only based on keyword, may result in by it is some with it is crucial The not corresponding data undercarriage of word.
It can first be obtained and the keyword from database by the scheme of the disclosure based on the keyword of user's input Corresponding data (hereinafter referred to as the second search result), then arbitrarily select a data as base in the second search result Quasi- data, then based on hereinbefore described method determine with the duplicate data of the reference data, finally by repeated data and Reference data undercarriage from database, that is, delete.By this programme, can accurately find corresponding with the keyword in database The data for needing undercarriage.
Based on principle identical with method shown in Fig. 1, a kind of device 20 is additionally provided in embodiment of the disclosure, is such as schemed Shown in 2, which may include: data acquisition module 210, and candidate data determining module 220 and repeated data determine mould Block 230, in which:
Data acquisition module 210, for obtaining reference data, reference data is image or video;
Candidate data determining module 220, for the first eigenvector of the first data type based on reference data, and The second feature vector of first data type of each data to be compared, determine in data to be compared with the duplicate candidate of reference data Data, wherein data to be compared include at least one of image and video;
Repeated data determining module 230, for the third feature vector according to the second data type of reference data, and The fourth feature vector of second data type of each data in candidate data is determined to repeat with reference data from candidate data Data, wherein the precision of the data of the second data type be higher than the first data type data precision.
Scheme in embodiment of the disclosure, feature vector (first eigenvector and the second feature of the first data type Vector) and the feature vector (third feature vector sum fourth feature vector) of the second data type can reflect the spy of image Point, the precision of the feature vector of the second data type is higher than the feature vector of the first data type, compared to the first data type Feature vector, the characteristics of feature vector of the second data type can react image in further detail, therefore can first be based on The second feature vector of the first eigenvector of reference data and each data to be compared, determine in data to be compared with base value According to duplicate candidate data, data-handling efficiency is improved, further, the third feature vector based on reference data, and The fourth feature vector of each data in candidate data can be determined accurately from candidate data and the duplicate number of reference data According to, thus, based on the scheme in the disclosure, can fast and accurately be determined from data to be compared duplicate with reference data Data.In addition, the scheme based on the disclosure, may be implemented the feature (single features) based on image or video, from number to be compared According to determining duplicate data in (blended data, while the data including video and image).
In embodiment of the disclosure, the first data type is integer, and the second data type is floating type.
In embodiment of the disclosure, candidate data determining module 220 in the first data type based on reference data The second feature vector of first data type of one feature vector and each data to be compared, determine in data to be compared with base When the candidate data of quasi- Data duplication, it is specifically used for:
By with first eigenvector have at least one dimension identical value second feature vector corresponding to data, It is determined as candidate data.
In embodiment of the disclosure, candidate data determining module 220 will tieed up with first eigenvector at least one Data corresponding to the second feature vector of the identical value of degree, when being determined as candidate data, are specifically used for:
Inverted index is obtained, inverted index is to establish based on second feature vector;
Based on first eigenvector and inverted index, with first eigenvector will there is the identical of at least one dimension to take Data corresponding to the second feature vector of value, are determined as candidate data.
In embodiment of the disclosure, repeated data determining module 230 is according to the of the second data type of reference data The fourth feature vector of second data type of each data in three feature vectors and candidate data is determined from candidate data Out when data duplicate with reference data, it is specifically used for:
Determine the characteristic similarity between third feature vector and fourth feature vector;
It will be greater than candidate data corresponding to the characteristic similarity of similarity threshold, be determined as and the duplicate number of reference data According to.
In embodiment of the disclosure, repeated data determining module 230 is determining third feature vector and fourth feature vector Between characteristic similarity when, be specifically used for:
Determine the COS distance between third feature vector and fourth feature vector;
According to COS distance, characteristic similarity is determined.
In embodiment of the disclosure, reference data is the data in first database, and data to be compared are first database In data in addition to the data;
Alternatively,
Data acquisition module 210 is specifically used for when obtaining reference data:
Obtain search key;
According to search key, search result is obtained from the second database, reference data is the data in search result, Data to be compared are the data in search result in addition to the data, or are the data in the second database.
In embodiment of the disclosure, if reference data is the data in first database, determining and reference data weight After multiple data, the device further include:
First data processing module, for based on the duplicate data of reference data, to first database carry out duplicate removal;
If data to be compared are data in addition to reference data in search result, determine it is duplicate with reference data After data, the device further include:
Second data processing module, for based on the duplicate data of reference data, to search result carry out duplicate removal;It will go Search result after weight is as final search result;Alternatively, being deleted to reference data and with the duplicate data of reference data It removes;
If data to be compared are the data in the second database, determining to be somebody's turn to do with after the duplicate data of reference data Device further include:
Third data processing module, for based on the duplicate data of reference data, to the second database carry out duplicate removal.
In embodiment of the disclosure, if reference data is video, which further includes feature vector determining module, for leading to Cross the first eigenvector that following manner determines reference data:
Extract the characteristics of image of the frame image in video;
Based on the characteristics of image of extracted frame image, the first eigenvector and third feature vector of video are determined.
In embodiment of the disclosure, feature vector determining module is determined in the characteristics of image based on extracted frame image When the first eigenvector and third feature vector of video, it is specifically used for:
Based on the characteristics of image of extracted frame image, the average value of the characteristics of image of each frame image in frame image is determined, Obtain the average image feature;Based on the average image feature, the first eigenvector and third feature vector of video are determined;
Alternatively,
Based on the characteristics of image of extracted frame image, the first eigenvector and third of each frame image in frame image are determined Feature vector;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of video, by each frame image The average value of third feature vector be determined as the third feature vector of video.
A kind of determination side of repeated data shown in FIG. 1 can be performed in the determining device of the repeated data of the embodiment of the present disclosure Method, realization principle is similar, moves performed by each module in the determining device of the repeated data in each embodiment of the disclosure Work is, determination for repeated data corresponding with the step in the determination method of the repeated data in each embodiment of the disclosure Each module of device detailed functions description specifically may refer to hereinbefore shown in corresponding repeated data determination method in Description, details are not described herein again.
Based on principle identical with the method in embodiment of the disclosure, present disclose provides a kind of electronic equipment, the electricity Sub- equipment includes processor and memory;Memory, for storing operational order;Processor, for being instructed by call operation, The method as shown in any embodiment in disclosed method of execution.
Based on principle identical with the method in embodiment of the disclosure, present disclose provides a kind of computer-readable storages Medium, the storage medium are stored at least one instruction, at least a Duan Chengxu, code set or instruction set, and at least one instructs, extremely A few Duan Chengxu, code set or instruction set are loaded by processor and are executed to realize appointing in the data processing method such as the disclosure Method shown in one embodiment.
In embodiment of the disclosure, as shown in figure 3, it illustrates the electronic equipments for being suitable for being used to realize the embodiment of the present disclosure The structural schematic diagram of 50 (such as the terminal devices or server for realizing method shown in Fig. 1).Electricity in the embodiment of the present disclosure Sub- equipment can include but is not limited to such as mobile phone, laptop, digit broadcasting receiver, PDA, and (individual digital helps Reason), the shifting of PAD (tablet computer), PMP (portable media player), car-mounted terminal (such as vehicle mounted guidance terminal) etc. The fixed terminal of dynamic terminal and such as number TV, desktop computer etc..Electronic equipment shown in Fig. 3 is only one and shows Example, should not function to the embodiment of the present disclosure and use scope bring any restrictions.
As shown in figure 3, electronic equipment 50 may include processing unit (such as central processing unit, graphics processor etc.) 501, It can be loaded into random access storage according to the program being stored in read-only memory (ROM) 502 or from storage device 508 Program in device (RAM) 503 and execute various movements appropriate and processing.In RAM 503, it is also stored with the behaviour of electronic equipment 30 Various programs and data needed for making.Processing unit 501, ROM 502 and RAM 503 are connected with each other by bus 504.It is defeated Enter/export (I/O) interface 505 and is also connected to bus 504.
In general, following device can connect to I/O interface 505: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 506 of head, microphone, accelerometer, gyroscope etc.;Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 507 of dynamic device etc.;Storage device 508 including such as tape, hard disk etc.;And communication device 509.Communication device 509, which can permit electronic equipment 50, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 3, which is shown, to be had The electronic equipment 50 of various devices, it should be understood that being not required for implementing or having all devices shown.It can substitute Implement or have more or fewer devices in ground.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communication device 509, or from storage device 508 It is mounted, or is mounted from ROM 502.When the computer program is executed by processing unit 501, the embodiment of the present disclosure is executed Method in the above-mentioned function that limits.
It should be noted that the above-mentioned computer-readable medium of the disclosure can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In open, computer-readable signal media may include in a base band or as the data-signal that carrier wave a part is propagated, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable and deposit Any computer-readable medium other than storage media, the computer-readable signal media can send, propagate or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: electric wire, optical cable, RF (radio frequency) etc. are above-mentioned Any appropriate combination.
Above-mentioned computer-readable medium can be included in above-mentioned electronic equipment;It is also possible to individualism, and not It is fitted into the electronic equipment.
Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are by the electricity When sub- equipment executes, so that the electronic equipment executes method shown in above method embodiment;Alternatively, above-mentioned computer-readable Jie Matter carries one or more program, when said one or multiple programs are executed by the electronic equipment, so that the electronics Equipment executes method shown in above method embodiment.
The calculating of the operation for executing the disclosure can be write with one or more programming languages or combinations thereof Machine program code, above procedure design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present disclosure can be realized by way of software, can also be by hard The mode of part is realized.Wherein, the title of unit does not constitute the restriction to the unit itself under certain conditions, for example, the One acquiring unit is also described as " obtaining the unit of at least two internet protocol addresses ".
Above description is only the preferred embodiment of the disclosure and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that the open scope involved in the disclosure, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from design disclosed above, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed in the disclosure Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (12)

1. a kind of determination method of repeated data characterized by comprising
Reference data is obtained, the reference data is image or video;
First data class of the first eigenvector of the first data type based on the reference data and each data to be compared The second feature vector of type, determine in the data to be compared with the duplicate candidate data of the reference data, wherein it is described to Comparison data includes at least one of image and video;
According to of each data in the third feature vector and the candidate data of the second data type of the reference data The fourth feature vector of two data types is determined and the duplicate data of the reference data from the candidate data, wherein The precision of the data of second data type is higher than the precision of the data of first data type.
2. the method according to claim 1, wherein first data type is integer, second data Type is floating type.
3. the method according to claim 1, wherein first data type based on the reference data The second feature vector of first data type of first eigenvector and each data to be compared, determines the data to be compared In with the duplicate candidate data of the reference data, comprising:
By with the first eigenvector have at least one dimension identical value second feature vector corresponding to data, It is determined as the candidate data.
4. according to the method described in claim 3, it is characterized in that, described will have at least one with the first eigenvector Data corresponding to the second feature vector of the identical value of dimension, are determined as the candidate data, comprising:
Inverted index is obtained, the inverted index is to establish based on the second feature vector;
Based on the first eigenvector and the inverted index, there will be at least one dimension with the first eigenvector Identical value second feature vector corresponding to data, be determined as the candidate data.
5. method according to claim 1 to 4, which is characterized in that described according to the of the reference data In the third feature vector and the candidate data of two data types the fourth feature of the second data type of each data to Amount, is determined and the duplicate data of the reference data from the candidate data, comprising:
Determine the characteristic similarity between the third feature vector and the fourth feature vector;
It will be greater than candidate data corresponding to the characteristic similarity of similarity threshold, be determined as repeating with the reference data Data.
6. method according to claim 1 to 4, which is characterized in that
The reference data is the data in first database, and the data to be compared are that the number is removed in the first database Data except;
Alternatively,
The acquisition reference data, comprising:
Obtain search key;
According to described search keyword, search result is obtained from the second database, the reference data is described search result In data, the data to be compared are data in addition to the data in described search result, or are second number According to the data in library.
7. according to the method described in claim 6, it is characterized in that,
If the reference data be the first database in data, determine with the duplicate data of the reference data it Afterwards, the method also includes:
Based on the described and duplicate data of the reference data, duplicate removal is carried out to the first database;
If the data to be compared are the data in described search result in addition to the reference data, determining and the base After the data of quasi- Data duplication, the method also includes:
Based on the described and duplicate data of the reference data, duplicate removal is carried out to described search result;By the search knot after duplicate removal Fruit is as final search result;Alternatively, described deleting to the reference data and with the duplicate data of the reference data It removes;
If the data to be compared be the second database in data, determine with the duplicate data of the reference data it Afterwards, the method also includes:
Based on the described and duplicate data of the reference data, duplicate removal is carried out to second database.
8. method according to claim 1 to 4, which is characterized in that described if the reference data is video The first eigenvector and third feature vector of reference data are determining in the following manner:
Extract the characteristics of image of the frame image in the video;
Based on the characteristics of image of extracted frame image, the first eigenvector and third feature vector of the video are determined.
9. according to the method described in claim 8, it is characterized in that, the characteristics of image based on extracted frame image, really The first eigenvector and third feature vector of the fixed video, comprising:
Based on the characteristics of image of extracted frame image, the average value of the characteristics of image of each frame image in the frame image is determined, Obtain the average image feature;Based on the average image feature, determine that the first eigenvector of the video and the third are special Levy vector;
Alternatively,
Based on the characteristics of image of extracted frame image, the first eigenvector and third of each frame image in the frame image are determined Feature vector;The average value of the first eigenvector of each frame image is determined as to the first eigenvector of the video, it will The average value of the third feature vector of each frame image is determined as the third feature vector of the video.
10. a kind of determining device of repeated data characterized by comprising
Data acquisition module, for obtaining reference data, the reference data is image or video;
Candidate data determining module, for the first eigenvector of the first data type based on the reference data, and it is each The second feature vector of first data type of data to be compared is determined in the data to be compared and is repeated with the reference data Candidate data, wherein the data to be compared include at least one of image and video;
Repeated data determining module, for the third feature vector according to the second data type of the reference data, Yi Jisuo The fourth feature vector for stating the second data type of each data in candidate data is determined and the base from the candidate data The data of quasi- Data duplication, wherein the precision of the data of second data type is higher than the data of first data type Precision.
11. a kind of electronic equipment characterized by comprising
Processor and memory;
The memory, for storing computer operation instruction;
The processor, for by calling the computer operation instruction, perform claim to be required described in any one of 1 to 9 Method.
12. a kind of computer readable storage medium, which is characterized in that the storage medium is stored at least one instruction, at least One Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or instruction set It is loaded as the processor and is executed to realize method described in any one of claims 1 to 9.
CN201910723196.XA 2019-08-06 2019-08-06 Method and device for determining repeated data, electronic equipment and computer storage medium Active CN110413603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910723196.XA CN110413603B (en) 2019-08-06 2019-08-06 Method and device for determining repeated data, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910723196.XA CN110413603B (en) 2019-08-06 2019-08-06 Method and device for determining repeated data, electronic equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN110413603A true CN110413603A (en) 2019-11-05
CN110413603B CN110413603B (en) 2023-02-24

Family

ID=68366280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910723196.XA Active CN110413603B (en) 2019-08-06 2019-08-06 Method and device for determining repeated data, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN110413603B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656393A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0990997A1 (en) * 1998-09-29 2000-04-05 Eastman Kodak Company A method for controlling and managing redundancy in an image database by elimination of automatically detected exact duplicate and near duplicate images
CN101034442A (en) * 2006-03-08 2007-09-12 刘欣融 System for judging between identical and proximate goods appearance design based on pattern recognition
WO2008062145A1 (en) * 2006-11-22 2008-05-29 Half Minute Media Limited Creating fingerprints
CN102930537A (en) * 2012-10-23 2013-02-13 深圳市宜搜科技发展有限公司 Image detection method and system
CN106375781A (en) * 2015-07-23 2017-02-01 无锡天脉聚源传媒科技有限公司 Method and device for judging duplicate video
CN108153882A (en) * 2017-12-26 2018-06-12 中兴通讯股份有限公司 A kind of data processing method and device
CN108665441A (en) * 2018-03-30 2018-10-16 北京三快在线科技有限公司 A kind of Near-duplicate image detection method and device, electronic equipment
CN108875062A (en) * 2018-06-26 2018-11-23 北京奇艺世纪科技有限公司 A kind of determination method and device repeating video
CN110019907A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of image search method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0990997A1 (en) * 1998-09-29 2000-04-05 Eastman Kodak Company A method for controlling and managing redundancy in an image database by elimination of automatically detected exact duplicate and near duplicate images
CN101034442A (en) * 2006-03-08 2007-09-12 刘欣融 System for judging between identical and proximate goods appearance design based on pattern recognition
WO2008062145A1 (en) * 2006-11-22 2008-05-29 Half Minute Media Limited Creating fingerprints
CN102930537A (en) * 2012-10-23 2013-02-13 深圳市宜搜科技发展有限公司 Image detection method and system
CN106375781A (en) * 2015-07-23 2017-02-01 无锡天脉聚源传媒科技有限公司 Method and device for judging duplicate video
CN110019907A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of image search method and device
CN108153882A (en) * 2017-12-26 2018-06-12 中兴通讯股份有限公司 A kind of data processing method and device
CN108665441A (en) * 2018-03-30 2018-10-16 北京三快在线科技有限公司 A kind of Near-duplicate image detection method and device, electronic equipment
CN108875062A (en) * 2018-06-26 2018-11-23 北京奇艺世纪科技有限公司 A kind of determination method and device repeating video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YI-CHEN LEE.ETAL: "To Repeat or Not to Repeat?: Redesigning Repeating Auditory Alarms Based on EEG Analysis", 《PROCEEDINGS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS》 *
韩逢庆等: "海量图片快速去重技术", 《计算机应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656393A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium
CN113656393B (en) * 2021-08-24 2024-01-12 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110413603B (en) 2023-02-24

Similar Documents

Publication Publication Date Title
CN109376268A (en) Video classification methods, device, electronic equipment and computer readable storage medium
CN110413812A (en) Training method, device, electronic equipment and the storage medium of neural network model
CN109656923A (en) A kind of data processing method, device, electronic equipment and storage medium
CN109618236A (en) Video comments treating method and apparatus
CN108764319A (en) A kind of sample classification method and apparatus
CN110119340A (en) Method for monitoring abnormality, device, electronic equipment and storage medium
CN110209658A (en) Data cleaning method and device
CN110222829A (en) Feature extracting method, device, equipment and medium based on convolutional neural networks
CN110198473A (en) Method for processing video frequency, device, electronic equipment and computer readable storage medium
CN110287161A (en) Image processing method and device
CN112035334B (en) Abnormal equipment detection method and device, storage medium and electronic equipment
CN110321705A (en) Method, apparatus for generating the method, apparatus of model and for detecting file
CN110413603A (en) Determination method, apparatus, electronic equipment and the computer storage medium of repeated data
CN103678458A (en) Method and system used for image analysis
CN108062576B (en) Method and apparatus for output data
CN110321858A (en) Video similarity determines method, apparatus, electronic equipment and storage medium
CN110781066A (en) User behavior analysis method, device, equipment and storage medium
CN110321454A (en) Processing method, device, electronic equipment and the computer readable storage medium of video
CN110348367A (en) Video classification methods, method for processing video frequency, device, mobile terminal and medium
CN111666449B (en) Video retrieval method, apparatus, electronic device, and computer-readable medium
CN109740130A (en) Method and apparatus for generating file
CN109101641A (en) Form processing method, device, system and medium
CN115086700A (en) Push processing method, device, equipment and medium
CN114625876A (en) Method for generating author characteristic model, method and device for processing author information
CN110362603B (en) Feature redundancy analysis method, feature selection method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant