CN115049962A

CN115049962A - Video clothing detection method, device and equipment

Info

Publication number: CN115049962A
Application number: CN202210716242.5A
Authority: CN
Inventors: 于博文; 刘思诚; 张伟; 旷章辉; 冯俐铜; 王新江; 李治中
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-09-13

Abstract

The embodiment of the disclosure discloses a video clothing detection method, a device and equipment, wherein the method comprises the following steps: partitioning a plurality of video frames of a video to be processed, and determining partition information corresponding to each video frame; detecting each video frame to obtain a detection result corresponding to a target person in each video frame; the detection result comprises the character information of the target character and the clothing information corresponding to the target character in the video frame; correcting the detection result of the video frames in the same partition based on the partition information corresponding to each video frame to obtain the target detection result corresponding to each video frame; and the target detection results corresponding to the video frames in the same partition are the same aiming at the detection results of the target person.

Description

Video clothing detection method, device and equipment

Technical Field

The present disclosure relates to, but not limited to, the field of image processing technologies, and in particular, to a method, an apparatus, and a device for detecting video apparel.

Background

With the rapid popularization of the internet and the rise and development of electronic commerce, image analysis technology based on computer vision has unprecedentedly developed. For the model and the clothes pictures shot by the ordinary users, the descriptive information of the wearing clothes, such as category, color, texture, neckline and the like, is acquired. Various attribute tags are added to a shot clothing picture in a manual mode, so that related clothing is searched based on the various attribute tags. Due to the fact that different users have different attribute cognition on the clothes, the standard attribute of the clothes in the clothes picture is difficult to determine, and further the related clothes are difficult to search.

Disclosure of Invention

In view of this, the disclosed embodiments at least provide a video apparel detection method, apparatus, device, storage medium and program product.

The technical scheme of the embodiment of the disclosure is realized as follows:

in one aspect, an embodiment of the present disclosure provides a video apparel detection method, where the method includes:

partitioning a plurality of video frames of a video to be processed, and determining partition information corresponding to each video frame;

detecting each video frame to obtain a detection result corresponding to a target person in each video frame; the detection result comprises the character information of the target character and the clothing information corresponding to the target character in the video frame;

correcting the detection result of the video frames in the same partition based on the partition information corresponding to each video frame to obtain the target detection result corresponding to each video frame; and the target detection results corresponding to the video frames in the same partition are the same aiming at the detection results of the target person.

In another aspect, an embodiment of the present disclosure provides a video apparel detection device, where the device includes:

the device comprises a partitioning module, a processing module and a processing module, wherein the partitioning module is used for partitioning a plurality of video frames of a video to be processed and determining the partition information corresponding to each video frame;

the detection module is used for detecting each video frame to obtain a detection result corresponding to a target person in each video frame; the detection result comprises the character information of the target character and the clothing information corresponding to the target character in the video frame;

the correction module is used for correcting the detection results of the video frames in the same partition based on the partition information corresponding to each video frame to obtain the target detection result corresponding to each video frame; and the target detection results corresponding to the video frames in the same partition are the same aiming at the detection results of the target person.

In yet another aspect, the present disclosure provides a computer device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements some or all of the steps of the above method when executing the program.

In yet another aspect, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements some or all of the steps of the above-described method.

In yet another aspect, the disclosed embodiments provide a computer program comprising computer readable code, which when run in a computer device, a processor in the computer device executes some or all of the steps for implementing the above method.

In yet another aspect, the disclosed embodiments provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, which when read and executed by a computer, implements some or all of the steps of the above method.

In the embodiment of the disclosure, by partitioning a plurality of video frames of a video to be processed, after the detection result of each video frame in a partition is obtained, voting statistics can be performed based on the detection result of each video frame in the partition to obtain the target detection result of each video frame, so that the detection result in the partition tends to be stable, the inconsistency of the detection results of the same target person and the same type of clothing in a section of video due to the transformation of light, angles and the like is avoided, and the overall detection accuracy is improved. Compared with the character detection and decoration detection of a single-frame video or image, the method and the device for processing the video can receive the video to be processed uploaded by the user, detect a plurality of video frames and achieve a wider recommendation range.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the technical aspects of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

fig. 5 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

fig. 6 is a schematic flow chart illustrating an implementation of a video clothing detection method according to an embodiment of the present disclosure;

FIG. 7 is an alternative interface schematic diagram of an apparel display interface provided by embodiments of the present disclosure;

FIG. 8 is a schematic business flow diagram of a clothing shopping guide system provided by an embodiment of the present disclosure;

fig. 9A is a schematic flow chart of a first retrieval strategy provided by an embodiment of the present disclosure;

fig. 9B is a schematic flow chart of a second retrieval strategy provided by an embodiment of the present disclosure;

FIG. 10 is an interface schematic diagram of a garment shopping guide interface provided by an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a video clothing detection apparatus provided in the embodiment of the present disclosure;

fig. 12 is a hardware entity diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure are further elaborated with reference to the drawings and the embodiments, the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Reference to the terms "first/second/third" merely distinguishes similar objects and does not denote a particular ordering for the objects, and it is understood that "first/second/third" may, where permissible, be interchanged in a particular order or sequence so that embodiments of the present disclosure described herein can be practiced other than as specifically illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the disclosure only and is not intended to be limiting of the disclosure.

Embodiments of the present disclosure provide a video apparel detection method, which may be performed by a processor of a computer device. The computer device refers to a device with data processing capability, such as a server, a notebook computer, a tablet computer, a desktop computer, a smart television, a set-top box, a mobile device (e.g., a mobile phone, a portable video player, a personal digital assistant, a dedicated messaging device, and a portable game device).

Fig. 1 is a schematic flow chart of an implementation of a video clothing detection method provided in an embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps S101 to S103:

step S101, partitioning a plurality of video frames of a video to be processed, and determining partition information corresponding to each video frame.

In some embodiments, the plurality of video frames may be all video frames in the video to be processed, or may be a plurality of video frames extracted from the video frames to be processed based on a preset frame extraction rule. The preset frame extraction rule may extract frames according to a preset frame extraction frequency, or extract frames based on the image quality of the video frame. After frame extraction, the character information and the decoration information in the whole video to be processed can be represented by partial video frames of the video to be processed, and further the calculated amount is reduced to a certain extent.

In some embodiments, the partitioning the plurality of video frames of the video to be processed may include: partitioning the video frames based on a plurality of preset partition points corresponding to the video to be processed; wherein the plurality of preset partition points may be a plurality of points that are uniformly distributed based on a time length (number of frames). In other embodiments, the partitioning the multiple video frames of the video to be processed may further include: and carrying out scene identification on the plurality of video frames, determining scene information of each video frame, and taking one video frame or a plurality of continuous video frames with the same scene information as a partition.

The partition information of each video frame is used to determine which partition the video frame is located in, and the number of the video frames in one partition may be one or multiple. In the case where the number of video frames in the partition is plural, the plural video frames are continuous.

Step S102, detecting each video frame to obtain a detection result corresponding to a target person in each video frame; the detection result comprises the character information of the target character and the clothing information corresponding to the target character in the video frame.

In some embodiments, for each of a plurality of video frames, people detection needs to be performed on the video frame separately, so as to obtain the people information of the target people in the video frame. The character features of the characters in the video frame can be extracted, and under the condition that the character features are matched with one preset character feature in a preset character feature library, the character information corresponding to the matched preset character features is used as the character information of the characters in the video frame. The person information further includes position information of the target person in the current video frame, that is, detection frame information of the target person.

In some embodiments, for each of a plurality of video frames, clothing detection needs to be performed on the video frame, so as to obtain corresponding clothing information of a target person in the video frame. Detecting the detection frame information corresponding to the clothing in the video frame, and intercepting the corresponding clothing picture from the video frame based on the detection frame information corresponding to the clothing; and then, detecting the attribute information of the clothing based on the clothing picture of the clothing to obtain the corresponding clothing attribute, wherein the clothing attribute can be the category information of the clothing. The apparel information includes an apparel detection box and a corresponding apparel attribute of the apparel.

Step S103, correcting the detection result of the video frames in the same partition based on the partition information corresponding to each video frame to obtain the target detection result corresponding to each video frame; and the target detection results corresponding to the video frames in the same partition are the same aiming at the detection results of the target person.

In some embodiments, considering that the detection results corresponding to the video frames in the same partition should be uniform, for each partition, the detection results of the video frames in the partition are corrected based on the partition information of each video frame in the partition to obtain the target detection result of each video frame.

Exemplarily, in the case that one partition includes 3 video frames including a first video frame, a second video frame, and a third video frame, if the character information and the decoration information of the target character respectively corresponding to the first video frame, the second video frame, and the third video frame are the same, the detection result of each video frame is not changed; if the detection result of the first video frame includes target person a1 and clothing B1, the detection result of the second video frame includes target person a1 and clothing B2, and the detection result of the third video frame includes target person a2 and clothing B1, it can be seen that the detection results in the video frames in the partition are different, and the detection results in the video frames in the partition need to be corrected.

In some embodiments, the detection result with the largest number of videos may be determined as the target detection result corresponding to each video frame in the partition by counting the number of video frames corresponding to each detection result in the partition. Counting the number of video frames corresponding to each piece of character information in the partition, and determining the character information with the largest number of videos as target character information corresponding to each video frame in the partition; and counting the number of video frames corresponding to each piece of clothing information (clothing category) in the partition, and determining the clothing information (clothing category) with the largest number of videos as the target clothing information corresponding to each video frame in the partition.

In other embodiments, the voting weight may be assigned to each video frame within the partition by separately determining the video frame quality of the corresponding video frame, wherein the voting weight is higher for higher video frame quality. And then, obtaining a target detection result based on the detection result of each video frame and the voting weight of each video frame.

In the embodiment of the disclosure, by partitioning a plurality of video frames of a video to be processed, after the detection result of each video frame in a partition is obtained, voting statistics can be performed based on the detection result of each video frame in the partition to obtain the target detection result of each video frame, so that the detection result in the partition tends to be stable, the detection results of the same target person and the same type of clothing in a section of video are prevented from being inconsistent due to the change of light, angles and the like, and the overall detection accuracy is improved. Compared with the character detection and decoration detection of a single-frame video or image, the method and the device for processing the video can receive the video to be processed uploaded by the user, detect a plurality of video frames and achieve a wider recommendation range.

Fig. 2 is an alternative flow chart diagram of a video apparel detection method provided by an embodiment of the disclosure, which may be executed by a processor of a computer device. Based on fig. 1, S101 in fig. 1 may be updated to S201 to S203, which will be described in conjunction with the steps shown in fig. 2.

Step S201, obtaining a plurality of video frames corresponding to the video to be processed.

In some embodiments, the to-be-processed video may include a plurality of original video frames having a time sequence relationship, and taking frame rate 30 as an example, the to-be-processed video with length of 1 second may include 30 original video frames, and the plurality of video frames may be a plurality of original video frames having a time sequence relationship among the to-be-processed video frames.

In other embodiments, to reduce the amount of data calculation, the video to be processed may be subjected to frame extraction processing based on a preset frequency to obtain a plurality of video frames; namely, frame extraction processing is carried out from a plurality of original video frames according to a preset frequency, and a plurality of video frames are obtained.

Step S202, determining a change category corresponding to each video frame based on an adjacent video frame set corresponding to each video frame; the change category is used for representing the change degree of the video frame relative to the corresponding adjacent video frame.

In some embodiments, the determining of the change category corresponding to each of the video frames based on the neighboring video frame set corresponding to each of the video frames can be implemented through steps S2021 to S2022.

Step S2021, regarding each of the video frames, taking at least one video frame adjacent to the video frame as an adjacent video frame set corresponding to the video frame.

In some embodiments, for each video frame, in the determination of the video frame changing category, at least one video frame adjacent to the video frame may be acquired as an adjacent video frame set corresponding to the video frame. At least one video frame adjacent to the video frame may be uniformly distributed before the video frame, may be uniformly distributed after the video frame, and may be respectively distributed before and after the video frame.

Step S2022, inputting the adjacent video frame set corresponding to the video frame into the trained video segmentation network to obtain the change category corresponding to the video frame.

In some embodiments, a set of neighboring video frames including the video frame may be input to a video segmentation network, and the video segmentation network may predict a change category of the video frame based on a change between the video frames in the input set of neighboring video frames.

Wherein the variation classes may include a first class characterizing a first degree of variation and a second class characterizing a second degree of variation, the second degree of variation being greater than the first degree of variation. For example, the first category may be a category that characterizes the video frame as not significantly changing in the set of neighboring video frames; the second category may be a category characterizing the video frame as significantly changing in the set of neighboring video frames.

In some embodiments, the training process of the video segmentation model may include: obtaining a plurality of sample video frames and a standard class label corresponding to each sample video frame, where the standard class label may include the first class and the second class; sequentially acquiring a sample video frame set from the plurality of sample video frames in a preset sliding window, and inputting the sample video frame set into an initial video segmentation model to obtain a preset category label corresponding to the sample video frame set, wherein the preset category label is used for representing the change category of a middle video frame of the sample video frame set; determining a loss value based on the obtained preset class label and standard class label, adjusting model parameters of the initial video segmentation model based on the loss value until the model converges, determining the trained initial video segmentation model as the trained video segmentation network,

step S203, dividing the plurality of video frames based on the change category corresponding to each video frame to obtain the partition information corresponding to each video frame.

In some embodiments, the categories of variation include a first category characterizing a first degree of variation and a second category characterizing a second degree of variation. Accordingly, the above-mentioned dividing the plurality of video frames based on the change category corresponding to each video frame can be implemented through steps S2031 to S2032, so as to obtain the partition information corresponding to each video frame.

Step S2031, clustering the video frames corresponding to each change category based on the time sequence relation corresponding to the plurality of video frames to obtain at least one video frame set; the video frame set comprises one video frame or at least two continuous video frames, and the at least two continuous video frames have the same change category.

In some embodiments, the clustered set of video frames may include one video frame, and may also include at least two video frames. Wherein, in case the set of video frames comprises at least two video frames, the at least two video frames are consecutive and have the same change category.

Illustratively, if there are 20 video frames, wherein 1 st to 8 th video frames are of a first category, 9 th to 11 th video frames are of a second category, 12 th to 14 th video frames are of a first category, 15 th video frames are of a second category, and 16 th to 20 th video frames are of a first category; the first set of video frames comprises 1 st to 8 th video frames, the second set of video frames comprises 9 th to 11 th video frames, the third set of video frames comprises 15 th video frames, and the fourth set of video frames comprises 16 th to 20 th video frames.

Step S2032, dividing the plurality of video frames by taking the video frame positioned at the center of the target video frame set as a dividing reference to obtain the partition information corresponding to each video frame; the target video frame set is the video frame set of the second category.

Based on the above example, if there are 20 video frames, wherein the 1 st to 8 th video frames are in the first category, the 9 th to 11 th video frames are in the second category, the 12 th to 14 th video frames are in the first category, the 15 th video frame is in the second category, and the 16 th to 20 th video frames are in the first category; then the 10 th video frame in the 9 th to 11 th video frames in the second category can be used as the first division reference, and the 15 th video frame in the second category can be used as the second division reference; further, the 20 video frames may be divided into 1 st to 9 th video frames as a first partition, 11 th to 14 th video frames as a second partition, and 16 th to 20 th video frames as a third partition.

Based on the embodiment, because the plurality of video frames of the video to be processed are partitioned, after the detection result of each video frame in the partition is obtained, voting statistics can be carried out based on the detection result of each video frame in the partition to obtain the target detection result of each video frame, so that the detection result in the partition tends to be stable, the inconsistency of the detection results of the same target person and the same type of clothes in a section of video due to the transformation of light, angles and the like is avoided, and the overall detection accuracy is improved.

Fig. 3 is an alternative flow chart diagram of a video apparel detection method provided by an embodiment of the disclosure, which may be executed by a processor of a computer device. Based on fig. 1, S102 in fig. 1 may be updated to S301 to S302, which will be described in conjunction with the steps shown in fig. 3.

Step S301, aiming at each video frame, carrying out person detection on the video frame, and determining the person information corresponding to the target person in the video frame.

In some embodiments, the above-mentioned person detection on the video frame may be implemented through steps S3011 to S3014, and the person information corresponding to the target person in the video frame is determined.

Step S3011, performing face detection on the video frame, and determining a face image corresponding to a face in the video frame.

In some embodiments, the step S3011 may be implemented as follows: performing face detection on the video frame, and determining a partial image where a face is located in the video frame and a key point position corresponding to the face; and calibrating the partial image of the face based on the positions of the key points corresponding to the face to obtain a face image corresponding to the face.

And S3012, extracting the face features of the face image to obtain the face features to be matched corresponding to the face.

In some embodiments, feature extraction may be performed on the face image based on a preset face feature extraction network to obtain a face feature to be matched corresponding to the face.

Step S3013, a preset face feature library is obtained, where the face feature library includes at least one preset person, and person information and a plurality of face features corresponding to each preset person.

In some embodiments, the preset person may be a person to be identified, taking an implementation scene as a star clothes recommendation scene as an example, and the preset person in the face feature library may be a preset star object; taking the implementation scene as the recommended scene of the costume of the teacher as an example, the preset person in the face feature library may be a preset teacher object of the current school.

In some embodiments, for each preset person in the face feature library, in order to improve the accuracy of detecting persons in a video frame, multiple face features may be set for each preset person, and in the process of determining whether a person (face) in the video frame is a preset person in the face feature library, feature distances between the face features to be matched corresponding to the face in the video frame and each face feature corresponding to the preset person may be respectively calculated, so as to obtain whether the face in the video frame is the preset person.

Step S3014, determining a target preset person corresponding to the face among the preset persons based on feature distances between the face features to be matched and a plurality of face features corresponding to each preset person, and determining person information corresponding to the target preset person as person information corresponding to the target person.

In some embodiments, the determining, based on the feature distance between the facial feature to be matched and the plurality of facial features corresponding to each of the preset persons, a target preset person corresponding to the face among the preset persons includes: respectively determining the feature distance between the human face features to be matched and each human face feature corresponding to each preset figure; comparing the characteristic distance of each face characteristic with a preset distance threshold value, and determining a similar result corresponding to each face characteristic; the similarity result is used for representing whether the human face corresponds to a preset figure corresponding to the human face feature; and determining a target preset figure corresponding to the face in the plurality of preset figures based on the similarity result corresponding to each face feature.

Under the condition that the N preset persons correspond to 7 face features and the face features are 2048-dimensional feature vectors, a 7 Nx 2048 feature matrix can be obtained; meanwhile, the face features to be matched are feature vectors of 2048 dimensions, and the obtained face features to be matched and the 7 Nx 2048 feature matrix are subjected to 7N distance calculations to obtain a1 x 7N distance matrix. Here, 1 means that 1 face corresponding to 1 face feature to be matched is detected in the current video frame, and an M × 7N distance matrix can be obtained when M faces are detected. Based on a preset distance threshold, the 1 × 7N distance matrix is binarized, that is, elements larger than the distance threshold are set as first values, and elements smaller than the distance threshold are set as second values, so that a to-be-verified binary matrix (1 × 7N) can be obtained, wherein the to-be-verified binary matrix comprises a similarity result of each face feature.

And under the condition that the elements in the binary matrix are all 0, the fact that the person in the video frame is not any preset person in the face feature library is represented.

In some embodiments, when the first similarity results all correspond to one matched preset person, determining the matched preset person as the target preset person; the first similarity result represents a preset figure corresponding to the face characteristic and corresponding to the face.

And when at least two elements with the value of 1 are positioned in the interval of the same matched preset person, taking the same matched preset person as a target preset person corresponding to the face features to be matched in the current video frame.

In some embodiments, in a case that the first similarity result corresponds to at least two matched preset persons, the target preset person is determined among the at least two matched preset persons based on a sum of feature distances between the facial features to be matched and a plurality of facial features corresponding to each of the matched preset persons.

The method comprises the steps that at least two elements with the value of 1 are located in at least two intervals of matched preset persons, one target matched person needs to be further determined from the at least two matched preset persons to serve as a target preset person corresponding to a face in a current video frame, and due to the fact that the vector distance corresponding to the same person is small and the vector distance corresponding to different persons is large, for each matched preset person in the at least two matched preset persons, the feature distance sum of the face features to be matched and the face features of each matched preset person can be respectively calculated, then at least two feature distance sums can be obtained, and the minimum feature distance sum and the corresponding matched preset person serve as the target preset person corresponding to the face features to be matched in the current video frame.

Step S302, clothing detection is carried out on the video frame, and clothing information of clothing corresponding to the target person is determined.

In some embodiments, the clothing information includes a clothing category of clothing, and the clothing detection on the video frame may be implemented through steps S3021 to S3024 to determine the clothing information corresponding to the target person.

And S3021, extracting a clothing feature map corresponding to the video frame.

And S3022, determining detection frame information corresponding to the clothing in the video frame based on the clothing feature map.

In some embodiments, the detection box information corresponding to the apparel includes a position of the detection box corresponding to the apparel in the video frame. Based on the clothing feature map corresponding to the video frame, placing a plurality of anchor points (placing anchor points with different sizes by taking the feature points as centers) at each feature point in the clothing feature map, wherein each anchor point can predict the position of an alternative frame; in the training process, the anchor frame close to the real dress frame is used as a positive sample, and the other anchor frames are used as negative samples, so that the positions of the prediction frames corresponding to the anchor frame of the positive sample are accurate during testing, and 100 prediction frames with the highest score are selected as alternative frames. And performing regional pooling on the obtained candidate frames on the feature map, wherein the pooled features are used for classifying the candidate frames and further optimizing the candidate frames (determining position offset and adjusting the positions of the candidate frames), and finally obtaining the category and the final position of the uniform frame.

And S3023, determining clothing characteristics corresponding to the clothing based on the detection frame information corresponding to the clothing.

In some embodiments, the step S3023 may be implemented by: based on the detection frame information corresponding to the clothing, a clothing picture corresponding to the clothing is intercepted from the video frame; and extracting the features of the clothing pictures to obtain clothing features corresponding to the clothing.

In some embodiments, the apparel feature includes a plurality of apparel sub-features, and step S3023 above may also be implemented by: based on the detection frame information corresponding to the clothing, a clothing picture corresponding to the clothing is intercepted from the video frame; performing data enhancement processing on the clothing picture to obtain at least one similar clothing picture corresponding to the clothing; and extracting the features of the clothing picture and the at least one similar clothing picture to obtain a plurality of clothing sub-features corresponding to the clothing.

And for the clothing, based on the detection frame information corresponding to the clothing, the clothing picture corresponding to the clothing can be intercepted from the video frame. Because the clothing picture is a single image and the clothing has the characteristics of folds, deformation, uneven size ratio and the like, the characteristic capture by the single clothing picture can have the characteristics of contingency, randomness, background noise and the like. Therefore, intelligent data enhancement is needed for the clothing pictures to enhance the network's awareness of clothing details, such as texture patterns. The data enhancement processing includes at least one of: and various affine transformations such as translation, scaling, overturning and the like. And then the clothing picture corresponding to the clothing in the video frame and at least one similar clothing picture obtained by data enhancement can be obtained.

The same feature extraction method can be adopted for extracting the features of each picture aiming at the clothing picture and the at least one similar clothing picture to obtain a plurality of clothing sub-features corresponding to the clothing.

And S3024, determining the clothing category of the clothing based on the clothing characteristics corresponding to the clothing.

In some embodiments, the apparel categories may include a clothing category, a texture category, and a dominant hue category. In order to improve the classification accuracy of the clothing category, a first classifier corresponding to the clothing category, a second classifier corresponding to the texture category and a third classifier corresponding to the dominant hue category may be constructed in advance, and clothing features (a plurality of clothing sub-features) corresponding to the clothing are input to the first classifier, the second classifier and the third classifier respectively to obtain the clothing category, the texture category and the dominant hue category of the clothing.

Based on the embodiment, because the person detection and the decoration detection are respectively carried out on each video frame in the plurality of video frames, the clothing information and the person information corresponding to each video frame can be obtained, and the detection accuracy is improved.

Fig. 4 is an alternative flow chart diagram of a video apparel detection method provided by an embodiment of the disclosure, which may be executed by a processor of a computer device. The clothing information comprises data information of the clothing; based on fig. 3, S302 in fig. 3 may further include S401 to S403, which will be described in conjunction with the steps shown in fig. 4.

S401, determining a target retrieval strategy in a first retrieval strategy and a second retrieval strategy based on the data volume of a preset clothing library; the data volume corresponding to the first retrieval strategy is higher than the data volume corresponding to the second retrieval strategy; the preset clothing library comprises a plurality of preset clothing and data information corresponding to each preset clothing.

In some embodiments, a data amount threshold may be set, and in the case that the data amount of the preset clothing library is greater than or equal to the data amount threshold, the first retrieval policy is taken as the target retrieval policy; and taking the second retrieval strategy as the target retrieval strategy when the data volume of the preset clothing library is smaller than the data volume threshold value.

In some embodiments, the data information corresponding to the preset clothes may include a clothes picture, a clothes name, a purchase link, and the like of the preset clothes.

Step S402, determining at least one preset clothing matched with the clothing in the plurality of preset clothing by using the target retrieval strategy.

In some embodiments, in the case that the target retrieval policy is the first retrieval policy, the above-mentioned determining, by using the target retrieval policy, at least one preset clothing matching with the clothing among the plurality of preset clothing may be implemented by steps S4021 to S4024.

S4021, determining clothing feature vectors corresponding to the clothing pictures based on the clothing pictures.

Step S4022, determining a target first central feature matched with the clothing feature vector in a plurality of first central features corresponding to the plurality of preset clothing; the plurality of first central features are determined after clustering a plurality of preset clothing vectors in the preset clothing library, and each first clustering result obtained by clustering corresponds to one first central feature.

S4023, based on the target first central feature, performing quantitative coding on the clothing feature vector to obtain clothing codes.

In some embodiments, the above-mentioned quantization encoding of the clothing feature vector based on the target first central feature may be implemented by: based on different characteristic positions, performing intra-characteristic splitting on the clothing characteristic vector to obtain clothing sub-vectors corresponding to each characteristic position corresponding to the clothing characteristic vector; based on the second central feature of each feature position corresponding to the target first central feature, carrying out quantization coding on the clothing sub-vector corresponding to each feature position to obtain a quantization value of the clothing sub-vector corresponding to each feature position; the second central feature is determined after clustering preset subvectors of a plurality of feature positions corresponding to the target first central feature, and each second clustering result obtained by clustering corresponds to one second central feature; and determining the clothing code based on the quantized value of the clothing sub-vector corresponding to each characteristic position.

S4024, determining the preset clothes corresponding to the preset clothes code corresponding to the target coding result as at least one preset clothes matched with the clothes; the target coding result is at least one quantization coding result matched with the clothing code in a plurality of quantization coding results corresponding to the target first central feature; the method for carrying out quantitative coding on the preset clothing vector corresponding to the target first central feature is the same as the method for carrying out quantitative coding on the clothing feature vector.

In some embodiments, the encoding method includes:

clustering a plurality of preset clothing vectors in the preset clothing library to obtain a plurality of first clustering results; each first clustering result corresponds to a first central feature;

for each first clustering result, performing intra-feature splitting on each preset clothing vector corresponding to the first clustering result based on different feature positions to obtain clothing sub-features corresponding to each feature position corresponding to each preset clothing vector;

clustering clothes sub-features corresponding to the feature positions in each first clustering result aiming at each feature position to obtain a plurality of second clustering results; each second clustering result corresponds to a second central feature; based on each second clustering result corresponding to a second central feature, carrying out quantization coding on each clothing sub-feature corresponding to the feature position to obtain a quantization value of each clothing sub-feature corresponding to the feature position;

and for each preset clothing vector, determining a quantized coding result of the preset clothing vector based on the quantized value of each clothing sub-feature in the preset clothing vector.

In some embodiments, in the case that the target retrieval policy is the second retrieval policy, the above-mentioned determining, by using the target retrieval policy, at least one preset clothing matching with the clothing among the plurality of preset clothing may be implemented by steps S4025 to S4027.

S4025, determining a whole feature vector and at least one local feature vector corresponding to the clothing picture based on the clothing picture of the clothing.

Step S4026, aiming at each preset clothing in the preset clothing, determining a first similarity between a preset overall vector and the overall characteristic vector corresponding to the preset clothing, and determining a second similarity between each preset local vector and each local characteristic vector corresponding to the preset clothing; determining a preset similarity between the preset clothing and the clothing based on the first similarity and at least one second similarity.

In some embodiments, a corresponding graph inference network may be constructed based on a first similarity and at least one second similarity, where one node represents the first similarity, the other nodes represent the second similarities, respectively, and each edge represents a relationship between two similarities (between the first similarity and the second similarity, the second similarity). And finally, judging the nodes of the global characteristics by the network by using a cross entropy classification loss function to determine the preset similarity between the preset clothes and the clothes.

Step S4027, determining at least one preset clothing matched with the clothing in the plurality of preset clothing based on the preset similarity corresponding to each preset clothing.

In some embodiments, the at least one preset apparel with the highest preset similarity is used as the at least one preset apparel matched with the apparel.

Step S403, determining data information corresponding to each preset garment matched with the garment as data information of the garment.

Based on the embodiment, the target retrieval strategy is determined in the first retrieval strategy and the second retrieval strategy based on the data volume of the preset clothing library, the preset clothing matched with the clothing is retrieved from the preset clothing library based on the target retrieval strategy, and the data information corresponding to the preset clothing is determined as the data information of the clothing, so that different retrieval strategies can be flexibly selected, and the retrieval efficiency is improved.

Fig. 5 is an alternative flow chart diagram of a video apparel detection method provided by an embodiment of the disclosure, which may be executed by a processor of a computer device. Based on fig. 1, the clothing information of the target person includes the detection box information and the clothing category corresponding to each clothing in the video frame, S103 in fig. 1 may be updated to S501 to S502, which will be described with reference to the steps shown in fig. 5.

S501, classifying each clothing in each video frame in each partition to obtain at least one detection category; and the information of the detection frame of the clothes corresponding to the detection category in the corresponding at least one video frame to be corrected meets a preset overlapping condition.

In some embodiments, all the detection frames in the partition may be classified based on the detection frame information corresponding to each clothing in each video frame, and the number of clothing existing in each video frame in the current partition may be determined, where each clothing corresponds to one detection category. And the information of the detection frame of the clothes corresponding to the detection category in the corresponding at least one video frame to be corrected meets a preset overlapping condition. For the convenience of understanding the present solution, each video frame includes a clothing as an example for illustration: respectively acquiring detection frame information of a piece of clothing existing in N video frames in the partition, and based on a preset classification algorithm, if the intersection ratio of every two pieces of detection frame information in the N pieces of detection frame information corresponding to the N video frames is judged to be larger than a preset threshold value, the partition corresponds to a detection type, namely only one piece of clothing in a real scene exists in the partition, and the clothing in the real scene has corresponding clothing information (detection frame information and clothing type) in each video frame of the partition.

For example, in a case where one partition includes 3 video frames including a first video frame, a second video frame, and a third video frame, if the detection result of the first video frame includes jacket a1 and trousers B1, the detection result of the second video frame includes jacket a1 and trousers B2, and the detection result of the third video frame includes jacket a2 and trousers B1. In the process of classifying the partition to obtain at least one detection category, classifying the information of 6 detection frames, if three jacket detection frames of the jacket a1 of the first video frame, the jacket a1 of the second video frame and the jacket a2 of the third video frame meet the preset overlapping condition, dividing the three jacket detection frames into a first detection category, and taking the video frames to be corrected corresponding to the first detection category as the first to third video frames; correspondingly, if the two trousers detection boxes of the trousers B2 of the second video frame and the trousers B1 of the third video frame satisfy the preset overlapping condition, the two trousers detection boxes are divided into a second detection category, and the video frames to be corrected corresponding to the second detection category are the second and third video frames; the trousers detection box of the trousers B1 of the first video frame is divided into a third detection category, and the video frame to be corrected corresponding to the third detection category is the first video frame.

Step S502, aiming at each dress, determining a target dress category corresponding to the dress based on the dress category corresponding to the dress in each video frame to be corrected.

Based on the above example, in the case of obtaining a first detection category (three jacket detection frames of the jacket a1 of the first video frame, the jacket a1 of the second video frame, and the jacket a2 of the third video frame), a second detection category (two trousers detection frames of the trousers B2 of the second video frame and the trousers B1 of the third video frame), and a third detection category (trousers detection frame of the trousers B1 of the first video frame), wherein the trousers detection frame of the first video frame is retained because there is only one video frame in the third detection category, and the clothing category is trousers B1; for the first detection category, the target clothes categories corresponding to the three coat detection frames are determined based on the coat A1 of the first video frame, the coat A1 of the second video frame and the coat A2 of the third video frame, namely, whether the clothes corresponding to the first detection category is the coat A1 or the coat A2 is determined; for the second inspection category, it is necessary to determine the target clothing category corresponding to the two trousers inspection boxes, that is, whether the clothing corresponding to the second inspection category is trousers B1 or trousers B2, based on trousers B2 of the second video frame and trousers B1 of the third video frame.

In some embodiments, the determining of the target clothing category corresponding to the clothing based on the clothing category corresponding to the clothing in each of the video frames to be corrected may be implemented through steps S5021 to S5023.

Step S5021, obtaining the quality of the video frame corresponding to each video frame to be corrected.

In some embodiments, the obtaining the video frame quality corresponding to each of the video frames to be corrected includes: for each video frame to be corrected, determining a clothing region corresponding to the video frame to be corrected based on the detection frame information corresponding to the clothing; and determining the quality of the video frame corresponding to the video frame to be corrected based on the video frame to be corrected and the clothing area.

Wherein the video frame quality comprises at least one of: the shielding degree of the clothes, the definition of the clothes area corresponding to the clothes and the brightness of the clothes area corresponding to the clothes.

Step S5022, determining voting weight corresponding to each video frame to be corrected based on the quality of the video frame corresponding to each video frame to be corrected; and the voting weight corresponding to the video frame to be corrected is positively correlated with the quality of the video frame corresponding to the video frame to be corrected.

Step S5023, based on the voting weight corresponding to each video frame to be corrected and the clothing category corresponding to the clothing in each video frame to be corrected, the target clothing category corresponding to the clothing is determined.

In some embodiments, the video frame quality of each video frame may be estimated based on the degree of occlusion of the clothing corresponding to the video frame, the sharpness of the clothing region corresponding to the clothing, and the brightness of the clothing region corresponding to the clothing, so as to obtain a video frame quality quantization value; and determining the voting weight corresponding to each video frame to be corrected based on the video frame quality quantization value corresponding to each video frame to be corrected.

For example, taking the first detection category as an example, the target clothing categories corresponding to the three jacket detection frames need to be determined based on the jacket a1 of the first video frame, the jacket a1 of the second video frame, and the jacket a2 of the third video frame. If the video frame quality quantization values of the first video frame to the third video frame are 2, 2 and 6 respectively, it can be determined that the voting weights corresponding to the first video frame to the third video frame are 0.2, 0.2 and 0.6 respectively, and in combination with the clothing category of the clothing in each video frame, the voting result can be obtained as that the jacket a1 is 0.4 and the jacket a2 is 0.6, and the clothing categories of the three jacket detection boxes corresponding to the first video frame to the third video frame are all set as the target clothing category (jacket a 2).

In other embodiments, the video frame quality includes quality information for at least one quality dimension, and the apparel category includes at least one apparel subcategory.

Accordingly, the determining the voting weight corresponding to each video frame to be corrected based on the quality of the video frame corresponding to each video frame to be corrected includes: for each of the clothing subcategories, determining a voting sub-weight corresponding to the clothing subcategory based on the degree of association between the clothing subcategory and each of the quality dimensions and the quality information of each of the quality dimensions.

Correspondingly, the determining a target clothing category corresponding to the clothing based on the voting weight corresponding to each video frame to be corrected and the clothing category corresponding to the clothing in each video frame to be corrected includes: for each of the clothing subcategories, determining a target clothing subcategory corresponding to the clothing based on the voting sub-weight corresponding to the clothing subcategory and the clothing subcategory corresponding to the clothing in each of the video frames to be corrected.

In some embodiments, apparel categories may include apparel sub-categories of different dimensions, and for each apparel sub-category, a voting sub-weight corresponding to the apparel sub-category may be determined based on a degree of association between the apparel sub-category and each of the quality dimensions to which it corresponds, and quality information for each of the quality dimensions.

In some embodiments, taking the example that the clothing category includes a dominant hue category, in the case that the video frame quality includes the degree of obstruction of the clothing, the degree of definition of the clothing region corresponding to the clothing, and the brightness of the clothing region corresponding to the clothing, the degree of association between the dominant hue category and the brightness of the clothing region is higher, and the degree of association between the degree of obstruction of the clothing and the degree of definition of the clothing region is lower; taking the clothing category as an example, the texture category has a high degree of association with the degree of occlusion and the sharpness of the clothing region, and has a low degree of association with the brightness of the clothing region.

Illustratively, taking the second detection category as an example, based on the pants of the second video frame (hue C1, texture D1) and the pants of the third video frame (hue C2, texture D2), the target clothing categories corresponding to the two pants detection frames are determined, i.e., whether the hue of this pants belongs to C1 or C2, and whether the texture belongs to D1 or D2. Under the conditions that the luminance quantization value of the second video frame is 3, the occlusion degree quantization value is 6, the definition is 6, and the luminance quantization value of the third video frame is 6, the occlusion degree quantization value is 3, and the definition is 3, in the process of determining the color tone of the trousers, the association degrees of the luminance, the occlusion degree, and the definition can be respectively set to be 6, 2, and 2, so that the voting result that the color tone of the trousers belongs to C1 is 42, and the voting result that the color tone of the trousers belongs to C2 is 48, that is, the color tone of the trousers belongs to C2; in the process of determining the texture of the trousers, the association degrees 2, 4 and 4 can be respectively set for the brightness, the shading degree and the definition, and the voting result 54 that the texture of the trousers belongs to D1 and the voting result 36 that the texture of the trousers belongs to D2 can be obtained, namely the texture of the trousers belongs to D1. It can be seen that for one detection category, i.e., for the same apparel, the target apparel subcategory for that apparel may exist in different video frames to be corrected.

Based on the embodiment, the voting statistics is carried out based on the detection results of the video frames in the partitions to obtain the target detection result of each video frame, so that the detection results in the partitions tend to be stable, the inconsistency of the detection results of the same target person and the same dress in a section of video due to the change of light, angles and the like is avoided, and the overall detection accuracy is improved.

Fig. 6 is an alternative flow chart diagram of a video apparel detection method provided by an embodiment of the disclosure, which may be executed by a processor of a computer device. Based on the above embodiment, taking fig. 1 as an example, the method in fig. 1 may further include S601 to S603, which will be described with reference to the steps shown in fig. 6.

And S601, playing the video to be processed through a clothing display interface.

In some embodiments, the apparel presentation interface may include a video playback area within which the to-be-processed video is played.

Step S602, in the process of displaying the target video frame in the video to be processed, displaying the character information of the target character and the clothing information corresponding to the target character in the target video frame.

In some embodiments, the target video frame is any one of a plurality of original video frames included in the video to be processed. In step S602, the target video frame is displayed as an example, and the content displayed in other areas in the clothing display interface is described, that is, as the video to be processed is played, the target video frame in the video playing area is changed, and accordingly, the content displayed in the other areas is changed as the target video frame is changed.

In some embodiments, in the process of presenting the target video frame in the video to be processed, the character information of the target character and the clothing information corresponding to the target character are displayed in the target video frame. The character information and the ornament information can display the corresponding positions/ranges of the characters and the ornaments in the form of rectangular frames, and display the identity information of the characters and the classification information corresponding to the ornaments in the form of characters.

In some embodiments, the displaying of the person information of the target person and the clothing information corresponding to the target person in the target video frame may be implemented through steps S6021 to S6022.

Step S6021, displaying the character information of the target character in the target video frame through the character display area in the clothing display interface.

In some embodiments, the person information of the target person includes a local face image of the target person in the target video frame. Correspondingly, the character display area may include a plurality of character sub-areas, and in a case that there are N target characters in the target video frame, N character sub-areas may be displayed in the character display area, and at the same time, each character sub-area displays a local face image of the corresponding target character in the current target video frame.

In some embodiments, the personal information of the target person includes the identity information of the target person.

Step S6022, displaying the clothing information corresponding to the target person in the target video frame through the clothing display area in the clothing display interface; the clothing information comprises a local clothing picture and a clothing category corresponding to each clothing in the target video frame.

In some embodiments, the clothing information corresponding to the target person includes a corresponding partial clothing picture of the target person in the target video frame. Accordingly, the clothing display area may include a plurality of clothing sub-areas, and in the case that N clothing exists in the target video frame, N clothing sub-areas may be displayed in the clothing display area, and at the same time, each clothing sub-area displays a local clothing image of the corresponding target person in the current target video frame.

In some embodiments, the apparel information corresponding to the target person includes an apparel category corresponding to each apparel. Wherein the apparel category may include, but is not limited to, at least one of: shirts, T-shirts, shorts, pants, dresses, jumpsuits, and the like.

In some embodiments, in order to facilitate the user to purchase the desired apparel, the method may further include steps S603 to S604.

Step S603, receiving a trigger operation for a target clothing picture in the local clothing picture corresponding to each clothing.

In some embodiments, a triggering operation of a user for a target dress picture in a local dress picture corresponding to at least one dress displayed in a current dress display area may be received through the dress display interface; in response to the trigger operation, step S604 is performed. The trigger operation may include a long-press operation, a click operation, a voice selection operation, or the like.

Step S604, responding to the triggering operation, displaying at least one relevant dress and a purchasing link corresponding to each relevant dress in a dress purchasing area in the dress displaying interface.

In some embodiments, in presenting the pending video, the apparel presentation interface may not display the apparel purchase area without receiving the trigger operation. In response to a trigger operation for a target clothes picture corresponding to a target video frame, the clothes purchasing area can be displayed, and at least one related clothes corresponding to the target clothes picture and a purchasing link corresponding to each related clothes are displayed through the clothes purchasing area.

In some embodiments, the method may further include receiving, through a clothing purchase area of the clothing presentation interface, a selection operation of a user for a target related clothing of the at least one related clothing, and jumping to a purchase interface corresponding to a purchase link of the target related clothing in response to the selection operation.

Referring to fig. 7 by way of example, fig. 7 is an alternative interface schematic diagram of an apparel display interface provided by an embodiment of the present disclosure, and apparel display interface 710 may include a character display area 711, an apparel display area 712, and a video playing area 714. In response to a trigger operation for a target apparel picture in the apparel display area, at least one related apparel corresponding to the target apparel picture and a purchase link corresponding to each of the related apparel may be displayed in apparel purchase area 713.

Based on the embodiment, the character information of the target character and the clothing information of the clothing existing in the currently displayed target video frame can be displayed in real time in the process of displaying the video to be processed through the visual interface, so that the correlation degree of the retrieval result and the original video is improved; meanwhile, after the click operation of the user on a certain clothing picture is received, at least one piece of relevant clothing corresponding to the clothing picture and each purchase link corresponding to the relevant clothing are further displayed, interaction with the user is promoted, and meanwhile convenience is provided for the user to purchase the same clothing.

The following describes an application of the video clothing detection method provided by the embodiment of the present disclosure in an actual scene.

With the rapid development of network economy in recent years, the online shopping market in 2021 has a transaction scale of 13 trillion yuan, and the growth rate still keeps strong, wherein the proportion of the clothing transaction is as high as 57.5%. In the scene of clothing online shopping, star pursuit type shopping such as the same star is very hot, but is limited by objective factors such as image quality, image special angle, object size and the like, and the image identification capability of the existing clothing retrieval algorithm under some special conditions is still insufficient to be expected.

Based on the wide market prospect and the technical pain points faced currently, the embodiment of the disclosure provides an AI star same style clothing shopping guide system based on video streaming, which integrates the functions of star identification, clothing attribute analysis, clothing recommendation and the like into a whole, makes up the defects of a single-frame image in a clothing retrieval scene through interval information in the form of video streaming, improves the accuracy of clothing retrieval and recommendation, and provides a more accurate and easy-to-use clothing shopping guide system for users.

Please refer to fig. 8, which illustrates a business process of the apparel shopping guide system in the embodiment of the present disclosure. As shown in fig. 8, the process includes:

step S801, acquiring a video to be processed, and performing frame extraction on the video to be processed to obtain a plurality of video frames.

The video uploading interface provided by the clothing shopping guide system can receive a section of video uploaded by a user, and the video is used as the video to be processed.

In some embodiments, the frame extraction processing may be performed on the video to be processed based on a preset frequency, so as to obtain a plurality of video frames corresponding to the video to be processed. Thereafter, the subsequent steps S802, S803, and S804 may be performed based on the plurality of video frames, respectively. Step S802, step S803, and step S804 may be performed simultaneously, and obtain a corresponding character (star) recognition result, a video slicing result, and a clothing attribute result, respectively.

And S802, identifying the target person in each video frame to obtain a person identification result.

In some embodiments, the identification of the person within each video frame may be accomplished by extracting facial features of the person in the video frame.

The method comprises the steps of obtaining a face image in a video frame by using a face detection technology, and then obtaining 5 key point position information corresponding to the face image by using the face key point detection technology, wherein the 5 key point position information comprises a left eye position, a right eye position, a nose position, a left mouth angle position and a right mouth angle position; taking a manually set standard Face as a correction reference of Face alignment (Face alignment), and aligning the detected Face with the standard Face to obtain an aligned Face image; and then extracting local texture features of the face image through a face feature extractor and reducing dimensions to obtain the face features of people in the video frame.

In some embodiments, in order to identify a target person in a video to be processed, for example, to identify a star person in the video to be processed, a preset face library corresponding to the target person (star) may be established in advance. The construction method of the preset human face library comprises the following steps: and receiving the introduced human face sample image of the target person through a preset human face interface, and performing the human face detection and human face alignment processes on the human face sample image to obtain a human face partial image corresponding to the target person.

In order to improve the recognition accuracy of the target person, the obtained plurality of face partial images may be subjected to data cleaning, so as to obtain a preset number of standard face images corresponding to the target person. The preset number of standard face images are used for verifying the face images to be recognized for multiple times. Illustratively, the preset number may be set to 7.

In some embodiments, the process of identifying the person in the current video frame is accomplished by: extracting standard face images corresponding to N target figures, and obtaining 7N standard face images under the condition that each target figure corresponds to 7 standard face images; respectively extracting the features of each standard face image to obtain 7N standard face features, and obtaining a 7N multiplied by 2048 feature matrix under the condition that the standard face features are characteristic vectors of 2048 dimensions; meanwhile, feature extraction is carried out on the face image corresponding to the person in the current video frame, a characteristic vector of 2048 dimensions can be obtained, 7N times of distance calculation is carried out on the obtained characteristic vector and a characteristic matrix of 7N × 2048, and a distance matrix of 1 × 7N can be obtained. Here, 1 means that 1 face is detected in the current video frame, and in the case that M faces are detected, an M × 7N distance matrix can be obtained.

For convenience of understanding, an example that 1 face is detected in a current video frame, that is, a1 × 7N distance matrix is obtained is taken as an example for description, and based on a preset distance threshold, the 1 × 7N distance matrix is binarized, that is, an element larger than the distance threshold is set as a first value, and an element smaller than the distance threshold is set as a second value, so that a binary matrix (1 × 7N) to be verified can be obtained. And under the condition that the elements in the binary matrix are all 0, the fact that the person in the video frame is not a target person in a preset face library is represented. Under the condition that one 1 exists in the elements in the binary matrix, taking the target person corresponding to the element with the value of 1 as the target person corresponding to the face in the current video frame; under the condition that at least two 1 s exist in the elements in the binary matrix, judging whether the number of target persons corresponding to the elements with the value of 1 is 1, and if at least two elements with the value of 1 are located in the interval of the same target person, taking the same target person as the target person corresponding to the face in the current video frame; in the interval where at least two elements with the value of 1 are located in at least two target persons, it is necessary to further determine one target person from the at least two target persons as a target person corresponding to a face in a current video frame, and due to the characteristics that the vector distance corresponding to the same person is small and the vector distance corresponding to different persons is large, for each target person from the at least two target persons, the feature distance sum of 2048-dimensional feature vectors corresponding to the video frame and 7 standard face features of each target person can be respectively calculated, so that at least two feature distance sums can be obtained, and the target person corresponding to the minimum feature distance sum is used as the target person corresponding to the face in the current video frame.

And step S803, partitioning the plurality of video frames to obtain video partitioning results.

The video partitioning result includes partition information corresponding to each video frame.

In some embodiments, for each video frame, the change category of the video frame may be determined based on the adjacent video frame corresponding to the video frame, and then the plurality of video frames are divided based on the change category corresponding to each video frame, so as to obtain the partition information corresponding to each video frame.

The method comprises the steps of extracting a plurality of preset frames of video frames from a plurality of video frames through a sliding window with a preset length according to a time sequence relation among the plurality of video frames, inputting the plurality of preset frames of video frames into a video segmentation network, and determining a change type of a middle video frame of the plurality of preset frames of video frames. The video segmentation network can determine the change category of the intermediate video frame based on the content change of the video frame among a plurality of input preset frames of video frames, and the change category can comprise a first category (no change) and a second category (abrupt change and gradual change). After the change type of each video frame in the plurality of video frames is obtained, the central points of the N continuous video frames in the second type are used as dividing references, the plurality of video frames are partitioned by using the obtained dividing references, and partition information corresponding to each video frame is obtained.

Illustratively, if there are 20 video frames, wherein 1 st to 8 th video frames are of a first category, 9 th to 11 th video frames are of a second category, 12 th to 14 th video frames are of a first category, 15 th video frames are of a second category, and 16 th to 20 th video frames are of a first category; the 10 th video frame of the 9 th to 11 th video frames of the second category may be taken as a first division reference, and the 15 th video frame of the second category may be taken as a second division reference; further, the 20 video frames may be divided into 1 st to 9 th video frames as a first partition, 11 th to 14 th video frames as a second partition, and 16 th to 20 th video frames as a third partition.

In some embodiments, the video segmentation network is a three-dimensional convolution network, which may have 5 or 6 layers, and the input data of the three-dimensional convolution network is a 5-dimensional matrix in the format of batch-size (number of channels of video frame) × temporal-length (number of video frames) × height (height of video frame) × width (width of video frame), and the output is a1 [ 3 ] logits. The loss function is multi-class cross-entry of multi-logits.

And S804, identifying the clothes in each video frame to obtain a clothes attribute result corresponding to each video frame.

In some embodiments, the process of identifying apparel for each video frame may include an apparel detection step and an apparel attribute classification step.

In some embodiments, the apparel detection step comprises a two-stage detection method.

In the first stage, detecting the position of an alternative frame of clothing in a video frame by using an alternative frame detection network, and in the first stage, placing a plurality of anchor points (placing anchor points with different sizes by taking the feature points as centers) on each feature point in the feature graph based on the feature graph corresponding to the video frame, wherein the position of one alternative frame can be predicted by each anchor point; in the training process, the anchor frame close to the real dress frame is used as a positive sample, and the other anchor frames are used as negative samples, so that the positions of the prediction frames corresponding to the anchor frame of the positive sample are accurate during testing, and 100 prediction frames with the highest score are selected as alternative frames.

In the second stage, the candidate frames obtained in the first stage are subjected to regional pooling on the feature map, and the pooled features are used for classifying the candidate frames and further optimizing the candidate frames (determining position offset to adjust the positions of the candidate frames), so that the categories and the final positions of the uniform frames are finally obtained.

In some embodiments, based on the final position of the clothing frame obtained in the clothing detection step, a clothing picture corresponding to clothing may be intercepted from the video frame, and in order to effectively extract clothing features, the clothing picture may be first scaled to a size of 256 × 256, and then input to a deep convolutional neural network to obtain a 2048-dimensional clothing feature vector; and selecting three classifiers corresponding to the category based on the category of the clothing frame obtained in the clothing detection step, and classifying the clothing category, the texture and the dominant hue of the clothing based on the three classifiers respectively. The three classifiers share the 2048-dimensional clothing feature vector extracted in the front, nonlinear transformation is performed in each attribute classifier, the 2048-dimensional clothing feature vector is converted into a 128-dimensional attribute feature vector, and then softmax operation is performed on the 128-dimensional attribute feature vector to obtain the category of the corresponding attribute.

In some embodiments, the process of identifying apparel for each video frame may further include a homogeneous apparel retrieval step. Wherein, the step of searching the same-style clothes can comprise a data enhancement process and a data searching process.

In some embodiments, the data enhancement process is used for intelligent data enhancement of currently obtained apparel pictures. Because the input clothes samples of the same style are few, and the clothes have the characteristics of folds, deformation, uneven size ratio and the like, the characteristic capture by singly depending on the clothes picture can have the characteristics of contingency, randomness, background noise and the like. Therefore, intelligent data enhancement is needed for the clothing pictures to enhance the network's awareness of clothing details, such as texture patterns. The data enhancement method comprises various affine transformations such as translation, scaling, turning and the like. And then the clothing picture corresponding to the clothing in the video frame and at least one similar clothing picture obtained by data enhancement can be obtained.

In some embodiments, for the clothing of the video frame, by performing a data retrieval process on the clothing picture obtained by the data enhancement process and at least one similar clothing picture, compared with a single process of retrieving the clothing picture, the retrieval accuracy can be improved. The data enhancement process can adopt different retrieval strategies based on different data volumes, wherein the retrieval strategies comprise a first retrieval strategy corresponding to a million-hundred million level of data volume and a second retrieval strategy corresponding to a million level of data volume.

Referring to fig. 9A, a flow chart of a first search strategy is shown, the first search strategy includes:

step one, carrying out quantization coding on each preset clothing vector in a preset clothing library to obtain a corresponding quantization coding result.

Referring to fig. 9A, first, coarse clustering is performed on all preset clothing vectors 911 to obtain K coarse clustering results 912, and a clustering center corresponding to each clustering result is stored. For each clustering result, uniformly splitting all preset clothing vectors 913 of the clustering result into L feature segments 914, and clustering each feature segment 914 to obtain a clustering center of each feature segment. And performing quantization coding on each feature segment based on the clustering center of each feature segment to obtain L quantization values corresponding to each preset clothing vector, and taking the L quantization values as the quantization coding result 915 corresponding to the preset clothing vector. As shown in fig. 9A, the first predetermined apparel vector may be quantized to obtain L quantized values "14, 201, 34, 67".

And step two, based on the quantitative coding method in the step one, coding the clothing feature vectors to be searched to obtain clothing codes, respectively comparing the clothing codes with the quantitative coding results corresponding to each preset clothing vector based on the clothing codes, and taking the same clothing data corresponding to the matched preset clothing vector as a search result.

In the second step, for the clothing feature vector needing to be retrieved, comparing the clothing feature vector with the clustering center corresponding to each clustering result, and determining the most similar clustering center; uniformly splitting the clothing feature vector into L feature segments by adopting the same method as the step one, and carrying out quantitative coding on each feature segment based on the clustering center of each feature segment corresponding to the most similar clustering center to obtain L quantized values corresponding to the clothing feature vector; and determining a plurality of preset clothing vectors with the L quantization values most similar to the clothing characteristic vector from the quantization coding results corresponding to the plurality of preset clothing vectors corresponding to the most similar clustering centers as the matched preset clothing vectors.

Referring to fig. 9B, a flow chart of a second search strategy is shown, wherein the second search strategy includes:

and taking the clothing picture and at least one similar clothing picture obtained by data enhancement as a picture to be inquired, and extracting the whole characteristic vector and the local characteristic vector corresponding to the picture to be inquired aiming at the picture to be inquired and the picture to be inquired. Meanwhile, for each preset clothing picture in the preset clothing library, extracting the corresponding overall characteristic vector and the corresponding local characteristic vector based on the same characteristic extraction method.

Referring to fig. 9B, in the process of extracting features of the to-be-queried picture 921 and the preset clothing picture 931, feature extraction may be performed on the to-be-queried picture 921 and the preset clothing picture 931 respectively by using a backbone network sharing weights, where the backbone network may be a pyramid network, and accordingly, multi-layer features of an input picture may be extracted. Then, performing Adaptive Window Pooling (Adaptive Window Pooling) on the multilayer features corresponding to the picture 921 to be queried to obtain an overall feature vector 922 and a local feature vector 923 corresponding to the picture 921 to be queried; maximum Pooling (Max Pooling) is performed on the multilayer features corresponding to the preset clothing picture 931 to obtain a whole feature vector 932 and a local feature vector 933 corresponding to the corresponding preset clothing picture 931. Then, determining local feature similarity 941 between every two local feature vectors 923 corresponding to the picture 921 to be queried and the local feature vectors 933 corresponding to the preset clothing picture 931 respectively; determining an overall similarity 942 between an overall feature vector 922 corresponding to the picture 921 to be queried and an overall feature vector 932 corresponding to the preset clothing picture 931; and constructing a corresponding graph inference network based on the local similarity 941 and the overall feature similarity 942, wherein in the graph inference network, one node represents the overall feature similarity, other nodes represent the local feature similarity of the picture 921 to be queried and the preset clothing picture 931 respectively, and each edge represents the relationship between the two similarities. Finally, the network judges the nodes of the global features by using the cross entropy classification loss function to determine whether the picture 921 to be queried and the preset clothing picture 931 belong to the same picture, and an output result 951 is obtained.

In some embodiments, in the process of extracting the multilayer features by using the pyramid network, a ternary loss function auxiliary network can be used for discovering difficult samples, and for each sample, the maximum distance of the positive samples (namely the hardest positive sample) and the minimum distance of the negative samples (namely the hardest negative sample) are taken as optimization targets of the loss function, so that the distances between similar samples are continuously reduced, the distances between different samples are continuously enlarged, and a better feature space is obtained, thereby ensuring effective learning.

Step S805, comparing and voting the clothing attribute results of the video frames in the same subarea based on the clothing attribute result corresponding to each video frame and the subarea information corresponding to each video frame to obtain the same clothing attribute result corresponding to the video frames in the same subarea.

In some embodiments, the comparison voting process is used to cluster the clothing attribute results of each video frame in a partition, determine the performance of the same item at different times, and then vote based on the clothing attribute results of the same object at different time dimensions, to finally obtain the final clothing attribute result of the object. Because the change of the same object in adjacent frames in the same scene in the video is not too large, all the recognition objects in different frames in the same subarea can be aggregated into the same object to be represented on the time axis.

In some embodiments, in the process of performing comparison voting, different voting weights may be set for each video frame in the same partition based on the quality of the video frames corresponding to different video frames, and in the process of performing comparison voting, a final clothing attribute result of the object may be obtained based on the voting weights corresponding to different video frames. The video frame quality may include, among other things, the brightness, sharpness, and degree of occlusion of the apparel picture.

Step 806, in the process of displaying the target video frame in the video to be processed, displaying the character information of the target character in the target video frame and the clothing attribute result corresponding to the target character.

In some embodiments, in the process of presenting the target video frame in the video to be processed, the character information of the target character and the result of the apparel attribute corresponding to the target character are displayed in the target video frame. The character information and clothing attribute result can display the corresponding positions/ranges of the characters and clothing in a rectangular frame mode, and display the identity information of the characters and the classification information corresponding to the clothing in a character mode.

In some embodiments, referring to fig. 10, fig. 10 shows an interface schematic diagram of a clothing shopping guide interface, and it can be seen that the clothing shopping guide interface includes a video playing area 1001, the video playing area 1001 is used for playing the video to be processed, taking the target video frame in the video to be processed as an example for displaying, and the clothing shopping guide interface further includes a person displaying area 1002, which displays the person information of the target person in the target video frame through the person displaying area 1002; the clothing shopping guide interface further comprises a clothing display area 1003, and clothing attribute results corresponding to the target characters in the target video frames are displayed through the clothing display area 1003.

In some embodiments, in response to a click operation on any clothing picture in clothing display area 1003, a clothing picture corresponding to the click operation may be displayed in clothing selection area 1004, and at the same time, a plurality of same-style clothing corresponding to the clothing picture and a purchase link corresponding to each same-style clothing are displayed in clothing purchase area 1005, and in response to a trigger operation on a target same-style clothing in the plurality of same-style clothing, a purchase interface corresponding to the target same-style clothing is jumped to.

In some embodiments, after the voting results with consistent intervals are obtained, the video frames are smoothed, so that the situation that the identification information (the character information and the decorative attribute result) continuously flickers in the frame extraction situation is avoided. And displaying the smoothed character information and decoration attribute results on a video frame in the form of a rectangular frame and classification information, finally synthesizing a video through the video frame, returning the video and each piece of SDK information to a front-end page (a clothing shopping guide interface), and correspondingly displaying the result output of each functional step on the front-end page.

Based on the embodiment, compared with image retrieval and clothing recommendation of a contrast single-frame image, the clothing shopping guide system can identify all clothing characteristics in an uploaded video and recommend the clothing characteristics, is not limited to the single-frame image, and has a wider recommendation range; meanwhile, image enhancement processing is added before feature retrieval, cognition on clothing details is enhanced through various data enhancement modes such as translation, rotation and scaling, diversity of feature retrieval input is improved, and retrieval effect is enhanced; meanwhile, algorithm result voting processing of a video interval is provided, so that the clothing retrieval result in the interval tends to be stable, the inconsistency of retrieval results of the same type of clothing in a section of video due to light, angle and the like is avoided, and the visual perception effect is better; the embodiment of the disclosure provides a million-level and hundred million-level 2-data-level single-frame image feature retrieval mode, and a corresponding implementation mode can be selected based on specific service scale and data scale.

Based on the foregoing embodiments, the embodiments of the present disclosure provide a video apparel detection apparatus, which includes each included unit and each module included in each unit, and may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a Microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 11 is a schematic structural diagram of a video clothing detection apparatus provided in an embodiment of the present disclosure, and as shown in fig. 11, the video clothing detection apparatus 1100 includes: a partitioning module 1101, a detection module 1102, and a correction module 1103, wherein:

a partitioning module 1101, configured to partition a plurality of video frames of a video to be processed, and determine partition information corresponding to each of the video frames;

the detection module 1102 is configured to detect each video frame to obtain a detection result corresponding to a target person in each video frame; the detection result comprises the character information of the target character and the clothing information corresponding to the target character in the video frame;

a correcting module 1103, configured to correct detection results of video frames in the same partition based on partition information corresponding to each video frame, to obtain a target detection result corresponding to each video frame; and the target detection results corresponding to the video frames in the same partition are the same aiming at the detection results of the target person.

In some embodiments, the partition module 1101 is further configured to:

acquiring a plurality of video frames corresponding to the video to be processed;

determining a change category corresponding to each video frame based on a set of adjacent video frames corresponding to each video frame; the change category is used for representing the change degree of the video frame relative to the corresponding adjacent video frame;

and dividing the plurality of video frames based on the change category corresponding to each video frame to obtain the partition information corresponding to each video frame.

In some embodiments, the partition module 1101 is further configured to:

for each video frame, taking at least one video frame adjacent to the video frame as an adjacent video frame set corresponding to the video frame;

and inputting the adjacent video frame set corresponding to the video frame into the trained video segmentation network to obtain the change category corresponding to the video frame.

In some embodiments, the categories of variation include a first category characterizing a first degree of variation and a second category characterizing a second degree of variation; the second degree of variation is higher than the first degree of variation; the partitioning module 1101 is further configured to:

clustering the video frames corresponding to each change category based on the time sequence relation corresponding to the video frames to obtain at least one video frame set; the video frame set comprises one video frame or at least two continuous video frames, and the at least two continuous video frames have the same change category;

dividing the plurality of video frames by taking a video frame positioned in the center of a target video frame set as a dividing reference to obtain partition information corresponding to each video frame; the target video frame set is the video frame set of the second category.

In some embodiments, the detecting module 1102 is further configured to:

for each video frame, carrying out character detection on the video frame, and determining character information corresponding to the target character in the video frame;

and detecting the clothing of the video frame, and determining clothing information of clothing corresponding to the target character.

In some embodiments, the detecting module 1102 is further configured to:

performing face detection on the video frame, and determining a face image corresponding to a face in the video frame;

extracting the face features of the face image to obtain the face features to be matched corresponding to the face;

acquiring a preset human face feature library, wherein the human face feature library comprises at least one preset person, person information corresponding to each preset person and a plurality of human face features;

and determining a target preset figure corresponding to the face in the preset figures based on the feature distance between the face features to be matched and the plurality of face features corresponding to each preset figure, and determining figure information corresponding to the target preset figure as figure information corresponding to the target figure.

In some embodiments, the detecting module 1102 is further configured to:

respectively determining the feature distance between the human face features to be matched and each human face feature corresponding to each preset figure;

comparing the characteristic distance of each face characteristic with a preset distance threshold value, and determining a similar result corresponding to each face characteristic; the similarity result is used for representing whether the human face corresponds to a preset figure corresponding to the human face feature;

and determining a target preset figure corresponding to the face in the plurality of preset figures based on the similarity result corresponding to each face feature.

In some embodiments, the detecting module 1102 is further configured to:

under the condition that the first similarity results correspond to one matched preset person, determining the matched preset person as the target preset person; the first similarity result represents a preset figure corresponding to the face characteristic corresponding to the face;

and under the condition that the first similarity result corresponds to at least two matched preset persons, determining the target preset person in the at least two matched preset persons based on the sum of the feature distances between the facial features to be matched and the plurality of facial features corresponding to each matched preset person.

In some embodiments, the clothing information includes a clothing category of clothing, and the detecting module 1102 is further configured to:

extracting a clothing feature map corresponding to the video frame;

determining detection frame information corresponding to the clothing in the video frame based on the clothing feature map;

determining clothing characteristics corresponding to the clothing based on the detection frame information corresponding to the clothing;

and determining the clothing category of the clothing based on the clothing characteristics corresponding to the clothing.

In some embodiments, the detecting module 1102 is further configured to:

based on the detection frame information corresponding to the clothing, a clothing picture corresponding to the clothing is intercepted from the video frame;

and extracting the features of the clothing pictures to obtain clothing features corresponding to the clothing.

In some embodiments, the apparel feature comprises a plurality of apparel sub-features; the detecting module 1102 is further configured to:

performing data enhancement processing on the clothing picture to obtain at least one similar clothing picture corresponding to the clothing;

and extracting the features of the clothing picture and the at least one similar clothing picture to obtain a plurality of clothing sub-features corresponding to the clothing.

In some embodiments, the apparel information includes data information for the apparel; the detecting module 1102 is further configured to:

determining a target retrieval strategy in a first retrieval strategy and a second retrieval strategy based on the data volume of a preset clothing library; the data volume corresponding to the first retrieval strategy is higher than the data volume corresponding to the second retrieval strategy; the preset clothing library comprises a plurality of preset clothing and data information corresponding to each preset clothing;

determining at least one preset clothing matched with the clothing in the plurality of preset clothing by using the target retrieval strategy;

and determining the data information corresponding to each preset dress matched with the dress as the data information of the dress.

In some embodiments, in the case that the target retrieval policy is the first retrieval policy, the detecting module 1102 is further configured to:

determining a clothing feature vector corresponding to the clothing picture based on the clothing picture of the clothing;

determining a target first central feature matched with the clothing feature vector in a plurality of first central features corresponding to the plurality of preset clothing; the plurality of first central features are determined after clustering a plurality of preset clothing vectors in the preset clothing library, and each first clustering result obtained by clustering corresponds to one first central feature;

based on the target first central feature, carrying out quantitative coding on the clothing feature vector to obtain clothing codes;

determining a preset clothing corresponding to a preset clothing code corresponding to the target coding result as at least one preset clothing matched with the clothing; the target coding result is at least one quantization coding result matched with the clothing code in a plurality of quantization coding results corresponding to the target first central feature; the method for carrying out quantitative coding on the preset clothing vector corresponding to the target first central feature is the same as the method for carrying out quantitative coding on the clothing feature vector.

In some embodiments, the detecting module 1102 is further configured to:

based on different feature positions, performing intra-feature splitting on the clothing feature vector to obtain clothing sub-vectors corresponding to each feature position corresponding to the clothing feature vector;

based on the second central feature of each feature position corresponding to the target first central feature, carrying out quantization coding on the clothing sub-vector corresponding to each feature position to obtain a quantization value of the clothing sub-vector corresponding to each feature position; the second central feature is determined after clustering preset subvectors of a plurality of feature positions corresponding to the target first central feature, and each second clustering result obtained by clustering corresponds to one second central feature;

and determining the clothing code based on the quantized value of the clothing sub-vector corresponding to each characteristic position.

In some embodiments, in the case that the target retrieval policy is the second retrieval policy, the detecting module 1102 is further configured to:

determining a whole characteristic vector and at least one local characteristic vector corresponding to the clothing picture based on the clothing picture of the clothing;

for each preset clothing in the preset clothing, determining a preset overall vector and the overall characteristic vector corresponding to the preset clothing to determine a first similarity, and determining a second similarity between each preset local vector and each local characteristic vector corresponding to the preset clothing; determining a preset similarity between the preset clothing and the clothing based on the first similarity and at least one second similarity;

and determining at least one preset clothing matched with the clothing in the plurality of preset clothing based on the preset similarity corresponding to each preset clothing.

In some embodiments, the clothing information of the target person includes detection box information and clothing category corresponding to each clothing in the video frame; the detecting module 1102 is further configured to:

for each partition, classifying each clothing in each video frame in the partition to obtain at least one detection category; detecting frame information of the clothes corresponding to the detection category in the corresponding at least one video frame to be corrected meets a preset overlapping condition;

and for each clothing, determining a target clothing category corresponding to the clothing based on the clothing category corresponding to the clothing in each video frame to be corrected.

In some embodiments, the detecting module 1102 is further configured to:

acquiring the quality of a video frame corresponding to each video frame to be corrected;

determining a voting weight corresponding to each video frame to be corrected based on the quality of the video frame corresponding to each video frame to be corrected; the voting weight corresponding to the video frame to be corrected is positively correlated with the quality of the video frame corresponding to the video frame to be corrected;

and determining a target clothing category corresponding to the clothing based on the voting weight corresponding to each video frame to be corrected and the clothing category corresponding to the clothing in each video frame to be corrected.

In some embodiments, the detecting module 1102 is further configured to:

for each video frame to be corrected, determining a clothing region corresponding to the video frame to be corrected based on the detection frame information corresponding to the clothing;

determining the quality of a video frame corresponding to the video frame to be corrected based on the video frame to be corrected and the clothing area;

In some embodiments, the video frame quality comprises quality information for at least one quality dimension, the apparel category comprises at least one apparel subcategory; the detecting module 1102 is further configured to:

for each clothing subcategory, determining a voting sub-weight corresponding to the clothing subcategory based on the degree of association between the clothing subcategory and each quality dimension and the quality information of each quality dimension;

the determining a target clothing category corresponding to the clothing based on the voting weight corresponding to each video frame to be corrected and the clothing category corresponding to the clothing in each video frame to be corrected comprises:

for each of the clothing subcategories, determining a target clothing subcategory corresponding to the clothing based on the voting sub-weight corresponding to the clothing subcategory and the clothing subcategory corresponding to the clothing in each of the video frames to be corrected.

In some embodiments, the video apparel detection apparatus 1100 further comprises a presentation module.

The display module is used for playing the video to be processed through a clothing display interface; and in the process of displaying a target video frame in the video to be processed, displaying the character information of the target character and the clothing information corresponding to the target character in the target video frame.

In some embodiments, the display module is further configured to:

displaying the character information of the target character in the target video frame through a character display area in the clothing display interface;

displaying the clothing information corresponding to the target character in the target video frame through a clothing display area in the clothing display interface; the clothing information comprises a local clothing picture and a clothing category corresponding to each clothing in the target video frame.

In some embodiments, the display module is further configured to:

receiving a trigger operation aiming at a target dress picture in the local dress pictures corresponding to each dress;

in response to the triggering operation, at least one relevant garment and a purchase link corresponding to each relevant garment are displayed in a garment purchase area in the garment display interface.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. In some embodiments, functions of or modules included in the apparatuses provided in the embodiments of the present disclosure may be used to perform the methods described in the above method embodiments, and for technical details not disclosed in the embodiments of the apparatuses of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that, in the embodiment of the present disclosure, if the video clothing detection method is implemented in the form of a software functional module, and is sold or used as a standalone product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination thereof.

The embodiment of the present disclosure provides a computer device, which includes a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor implements some or all of the steps of the above method when executing the program.

The disclosed embodiments provide a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, performs some or all of the steps of the above-described method. The computer readable storage medium may be transitory or non-transitory.

The disclosed embodiments provide a computer program comprising computer readable code, where the computer readable code runs in a computer device, a processor in the computer device executes some or all of the steps for implementing the above method.

The disclosed embodiments provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program that when read and executed by a computer performs some or all of the steps of the above method. The computer program product may be embodied in hardware, software or a combination thereof. In some embodiments, the computer program product is embodied in a computer storage medium, and in other embodiments, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Here, it should be noted that: the foregoing description of the various embodiments is intended to highlight various differences between the embodiments, which are the same or similar and all of which are referenced. The above description of the apparatus, storage medium, computer program and computer program product embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the disclosed apparatus, storage medium, computer program and computer program product, reference is made to the description of the embodiments of the method of the present disclosure for understanding.

Fig. 12 is a schematic diagram of a hardware entity of a video clothing detection apparatus provided in an embodiment of the present disclosure, as shown in fig. 12, the hardware entity of the video clothing detection apparatus 1200 includes: a processor 1201 and a memory 1202, wherein the memory 1202 stores a computer program operable on the processor 1201, and the processor 1201 implements the steps of the method of any of the above embodiments when executing the program.

The Memory 1202 stores a computer program executable on the processor, and the Memory 1202 is configured to store instructions and applications executable by the processor 1201, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by each module in the processor 1201 and the video apparel detection apparatus 1200, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The processor 1201 implements the steps of any of the above-described video apparel detection methods when executing the program. The processor 1201 generally controls the overall operation of the video apparel detection device 1200.

The disclosed embodiments provide a computer storage medium storing one or more programs, which are executable by one or more processors to implement the steps of the video apparel detection method of any of the above embodiments.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above processor function may be other, and the embodiments of the present disclosure are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; and may be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., including one or any combination of the above-mentioned memories.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the above steps/processes do not mean the execution sequence, and the execution sequence of each step/process should be determined by the function and the inherent logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure.

Claims

1. A video apparel detection method, characterized in that the method comprises:

2. The method according to claim 1, wherein the partitioning a plurality of video frames of the video to be processed and determining the partition information corresponding to each of the video frames comprises:

3. The method of claim 2, wherein the determining the change category corresponding to each of the video frames based on the neighboring set of video frames corresponding to each of the video frames comprises: for each video frame, taking at least one video frame adjacent to the video frame as an adjacent video frame set corresponding to the video frame; inputting the adjacent video frame set corresponding to the video frame into the trained video segmentation network to obtain the change category corresponding to the video frame;

and/or the presence of a gas in the gas,

the variation classes comprise a first class characterizing a first degree of variation and a second class characterizing a second degree of variation; the second degree of variation is higher than the first degree of variation; the dividing the plurality of video frames based on the change category corresponding to each video frame to obtain the partition information corresponding to each video frame includes: clustering the video frames corresponding to each change category based on the time sequence relation corresponding to the video frames to obtain at least one video frame set; the video frame set comprises one video frame or at least two continuous video frames, and the at least two continuous video frames have the same change category; dividing the plurality of video frames by taking a video frame positioned in the center of a target video frame set as a dividing reference to obtain partition information corresponding to each video frame; the target video frame set is the video frame set of the second category.

4. The method according to any one of claims 1 to 3, wherein the detecting each of the video frames to obtain a detection result corresponding to a target person in each of the video frames comprises:

5. The method of claim 4, wherein the detecting the person in the video frame and determining the person information corresponding to the target person in the video frame comprises:

6. The method of claim 5, wherein the determining a target preset person corresponding to the face among the preset persons based on the feature distance between the facial feature to be matched and the plurality of facial features corresponding to each preset person comprises:

7. The method of claim 6, wherein the determining a target pre-determined person corresponding to the face among the plurality of pre-determined persons based on the similarity result corresponding to each of the facial features comprises:

under the condition that the first similarity results correspond to one matched preset person, determining the matched preset person as the target preset person; the first similarity result represents a preset figure corresponding to the face and corresponding to the face feature;

8. The method according to any one of claims 4 to 7, wherein the clothing information includes clothing category of clothing, and the detecting clothing of the video frame and determining the clothing information of the clothing corresponding to the target person comprise:

extracting a clothing feature map corresponding to the video frame;

9. The method of claim 8, wherein the determining the clothing feature corresponding to the clothing based on the detection box information corresponding to the clothing comprises: based on the detection frame information corresponding to the clothing, a clothing picture corresponding to the clothing is intercepted from the video frame; performing feature extraction on the clothing picture to obtain clothing features corresponding to the clothing;

and/or the presence of a gas in the gas,

the apparel feature comprises a plurality of apparel sub-features; the determining the clothing characteristics corresponding to the clothing based on the detection frame information corresponding to the clothing comprises: based on the detection frame information corresponding to the clothing, a clothing picture corresponding to the clothing is intercepted from the video frame; performing data enhancement processing on the clothing picture to obtain at least one similar clothing picture corresponding to the clothing; performing feature extraction on the clothing picture and the at least one similar clothing picture to obtain a plurality of clothing sub-features corresponding to the clothing;

and/or the presence of a gas in the gas,

the clothing information comprises data information of the clothing; the detecting the clothing of the video frame and determining the clothing information of the clothing corresponding to the target character further comprises: determining a target retrieval strategy in a first retrieval strategy and a second retrieval strategy based on the data volume of a preset clothing library; the data volume corresponding to the first retrieval strategy is higher than the data volume corresponding to the second retrieval strategy; the preset clothing library comprises a plurality of preset clothing and data information corresponding to each preset clothing; determining at least one preset clothing matched with the clothing in the plurality of preset clothing by using the target retrieval strategy; and determining the data information corresponding to each preset dress matched with the dress as the data information of the dress.

10. The method of claim 9, wherein in the case that the target retrieval policy is the first retrieval policy, the determining, with the target retrieval policy, at least one preset garment matching the garment among the plurality of preset garments comprises:

11. The method of claim 10, wherein the quantization encoding the clothing feature vector based on the target first center feature to obtain clothing encoding comprises:

12. The method according to any one of claims 9 to 11, wherein in a case that the target retrieval policy is the second retrieval policy, the determining, by using the target retrieval policy, at least one preset clothing matching the clothing among the plurality of preset clothing includes:

13. The method according to any one of claims 1 to 12, wherein the clothing information of the target person includes detection box information and clothing category corresponding to each clothing in the video frame; the correcting the detection results of the video frames in the same partition based on the partition information corresponding to each video frame to obtain the target detection result corresponding to each video frame includes:

14. The method of claim 13, wherein the determining a target clothing category corresponding to the clothing based on the clothing category corresponding to the clothing in each of the video frames to be corrected comprises:

15. The method according to claim 14, wherein said obtaining the video frame quality corresponding to each of the video frames to be corrected comprises:

16. The method of claim 14 or 15, wherein the video frame quality comprises quality information for at least one quality dimension, and wherein the apparel category comprises at least one apparel sub-category; the determining the voting weight corresponding to each video frame to be corrected based on the video frame quality corresponding to each video frame to be corrected comprises:

17. The method according to any one of claims 1 to 16, further comprising:

playing the video to be processed through a clothing display interface;

and in the process of displaying a target video frame in the video to be processed, displaying the character information of the target character and the clothing information corresponding to the target character in the target video frame.

18. The method of claim 17, wherein said presenting the person information of the target person and the apparel information corresponding to the target person in the target video frame comprises:

19. A video apparel detection device, comprising:

20. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 18 when executing the program.