CN117523625A - Video character recognition method, device, equipment and storage medium - Google Patents

Video character recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN117523625A
CN117523625A CN202210908478.9A CN202210908478A CN117523625A CN 117523625 A CN117523625 A CN 117523625A CN 202210908478 A CN202210908478 A CN 202210908478A CN 117523625 A CN117523625 A CN 117523625A
Authority
CN
China
Prior art keywords
face image
face
face images
image sequence
voting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210908478.9A
Other languages
Chinese (zh)
Inventor
曾颖森
沈招益
杨思庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210908478.9A priority Critical patent/CN117523625A/en
Publication of CN117523625A publication Critical patent/CN117523625A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/817Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for identifying video characters, and relates to the technical field of artificial intelligence. The method comprises the following steps: extracting a face image sequence from the video, wherein the face image sequence comprises a plurality of face images; acquiring preliminary identification results and quality scores, which are respectively corresponding to a plurality of face images included in a face image sequence, wherein the preliminary identification results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images; voting the candidate characters corresponding to the face images according to the quality scores corresponding to the face images, and determining the target characters corresponding to the face image sequence. According to the method, the quality score of the face image is used for influencing the voting result of the candidate person corresponding to the face image, so that the accuracy of the identified target person can be improved, and the recall rate of the person in the video can be improved.

Description

Video character recognition method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for identifying video characters.
Background
Video person identification is an important task for computer vision and artificial intelligence, involving multiple image processing techniques.
In the related art, a plurality of face images obtained from a video are generally employed to identify a person in the video. And forming a face track according to the extracted face images, and determining the recognition result of the face track according to the recognition result of each face image.
Then, the accuracy of the target person corresponding to the track determined by the recognition result of each face image in the related art is low.
Disclosure of Invention
The embodiment of the application provides a video character recognition method, device, equipment and storage medium, which can score different quality of different face images, wherein the quality score of the face images influences the voting weight of a preliminary recognition result corresponding to the face images, so that a target character corresponding to a face image sequence is determined, and the accuracy of the recognized target character is higher. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a video person identification method, the method including:
extracting a face image sequence from a video, wherein the face image sequence comprises a plurality of face images;
Acquiring preliminary identification results and quality scores, which are respectively corresponding to a plurality of face images included in the face image sequence, wherein the preliminary identification results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images;
and voting candidate characters corresponding to the face images respectively according to the quality scores corresponding to the face images respectively, and determining target characters corresponding to the face image sequence.
According to an aspect of an embodiment of the present application, there is provided a video person identification apparatus, the apparatus including:
the image extraction module is used for extracting a face image sequence from the video, wherein the face image sequence comprises a plurality of face images;
the device comprises a result acquisition module, a quality score and a recognition module, wherein the result acquisition module is used for acquiring preliminary recognition results and quality scores respectively corresponding to a plurality of face images included in the face image sequence, the preliminary recognition results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images;
and the character determining module is used for voting candidate characters respectively corresponding to the face images according to the quality scores respectively corresponding to the face images, and determining target characters corresponding to the face image sequence.
According to an aspect of the embodiments of the present application, there is provided a computer device including a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the above-described method.
According to an aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described method.
According to one aspect of embodiments of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the above-described method.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects: and extracting a face image sequence from the video, wherein each face image sequence comprises a plurality of face images, voting candidate characters corresponding to the plurality of face images according to quality scores of the plurality of face images belonging to the same face image sequence, and determining target characters corresponding to the face image sequence according to voting results. According to the method and the device, the identification result of the face image is voted according to the image quality of the face image, and the target person corresponding to the face image sequence formed by a plurality of face images is determined according to the voting result, namely, the voting result of the candidate person corresponding to the face image is influenced by the image quality of the face image. Therefore, according to the technical scheme provided by the embodiment of the application, based on the preliminary identification result of the face image, the image quality of the face image is used as an important factor affecting the voting result of the candidate characters corresponding to the face image, so that the finally determined target characters are more accurate, and the recall rate of the characters in the video is higher.
Drawings
FIG. 1 is a schematic diagram of an implementation environment for an embodiment provided herein;
FIG. 2 is a schematic diagram of video character recognition results provided in one embodiment of the present application;
FIG. 3 is a flow chart of a method for video character recognition provided in one embodiment of the present application;
FIG. 4 is a schematic diagram of quality scores for face images provided in one embodiment of the present application;
FIG. 5 is a flow chart of a method for video character recognition according to another embodiment of the present application;
FIG. 6 is a schematic diagram of intra-track voting provided by one embodiment of the present application;
FIG. 7 is a flow chart of a method for video character recognition according to another embodiment of the present application;
FIG. 8 is a schematic diagram of track merging provided in one embodiment of the present application;
FIG. 9 is a schematic diagram of inter-track voting provided by one embodiment of the present application;
FIG. 10 is a block diagram of a video person identification method provided by one embodiment of the present application;
FIG. 11 is a block diagram of a video person identification apparatus provided in one embodiment of the present application;
FIG. 12 is a block diagram of a video person identification apparatus provided in another embodiment of the present application;
fig. 13 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eye observation or transmitting to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, and the like, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, inductive learning type teaching learning, and the like.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The technical scheme provided by the embodiment of the application relates to the technology of computer vision and the like of artificial intelligence, and is specifically described through the following embodiment.
Before describing embodiments of the present application, in order to facilitate understanding of the present solution, terms appearing in the present solution are explained below.
MOT (Multiple Object Tracking) technique: a single video is acquired and split into discrete frames at a specific frame rate (fps) for output, which objects are present in each frame is detected, the position of the object in each frame is noted, and whether the object images in different frames belong to the same target object or to different target objects is correlated. In the face recognition field, the general workflow of MOT algorithm: (1) given an original frame of video; (2) Operating an object detector to obtain a bounding box of the face image; (3) For each detected face image, different features, typically visual and motion features, are computed; (4) Then, the similarity calculation step calculates the probability that the two face images belong to the same target person; (5) Finally, the association step assigns a digital identifier to each target persona.
Clustering (clustering): dividing a data set into different classes or clusters according to a specific standard (such as distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is also as large as possible; the data of the same class after clustering are gathered together as much as possible, and the data of different classes are separated as much as possible. Common clustering methods include K-Means clustering, mean shift clustering, clustering by using a Gaussian mixture model, and the like, and the clustering method is not limited in the application.
Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The implementation environment of the scheme can comprise: a terminal device 10 and a server 20.
The terminal device 10 includes, but is not limited to, a mobile phone, a tablet computer, an intelligent voice interaction device, a game console, a wearable device, a multimedia playing device, a PC (Personal Computer ), a vehicle-mounted terminal, an intelligent home appliance, and the like. A client of a target application can be installed in the terminal device 10.
In the embodiment of the present application, the target application may be any application that can provide a video streaming content service. Typically, the application is a video-type application. Of course, streaming content services may be provided in other types of applications besides video-type applications. For example, news applications, social applications, interactive entertainment applications, browser applications, shopping applications, content sharing applications, virtual Reality (VR) applications, augmented Reality (Augmented Reality, AR) applications, and the like, which are not limited in this embodiment of the present application. In addition, for different applications, the pushed video may be different, and the corresponding functions may be different, which may be configured in advance according to actual requirements, which is not limited in the embodiment of the present application. Alternatively, the terminal device 10 has a client running the above application. In some embodiments, the streaming content service covers various vertical contents such as variety, movie, news, finance, sports, entertainment, games, etc., and the user can enjoy various forms of content services such as articles, pictures, small videos, short videos, live broadcast, themes, columns, etc. through the streaming content service.
The server 20 is used to provide background services for clients of target applications in the terminal device 10. For example, the server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms, but not limited thereto.
The terminal device 10 and the server 20 can communicate with each other via a network. The network may be a wired network or a wireless network.
In the method provided by the embodiment of the application, the execution subject of each step may be a computer device. The computer device may be any electronic device having the capability of storing and processing data. For example, the computer device may be the server 20 in fig. 1, the terminal device 10 in fig. 1, or another device other than the terminal device 10 and the server 20.
Referring to fig. 2, a schematic diagram of a video character recognition result provided in an embodiment of the present application is shown. In the figure, 200 is one of image frames of a video, 201 is one face image obtained from the image frame 200, and 202 is a recognition result "star X" obtained based on the face image 201.
In the related art, the method generally comprises the steps of face detection, face feature extraction, face index, data association, result synthesis and the like. Firstly, detecting all faces in a video frame through face detection, extracting a face feature with a fixed length from each detected face, carrying out face indexing in a constructed face feature base to obtain a primary recognition result, then carrying out data association on the faces by using face appearance information or motion information to form a motion track of the faces in the video, and finally integrating all face recognition results in each motion track to serve as a final recognition result of the track. The step of data association is generally implemented by adopting a MOT technology or a clustering method. In the video face recognition technology, the MOT technology and the clustering method are used for carrying out data association on a plurality of faces in a video sequence, and finally voting synthesis is carried out on recognition results of the plurality of faces, so that the accuracy and recall rate of face recognition are improved, but the methods still have a plurality of problems. On the one hand, the quality of the face image is not distinguished in the related art, faces with different quality exist in the video due to the conditions of definition, resolution, motion blur, illumination and the like, and the faces with different quality have different challenges on the recognition accuracy and the robustness of the recognition method. The high-quality face is relatively easy to face recognition method, the accuracy and precision of the recognition result output by the model are high, and the low-quality face is the opposite. When the related technology integrates the results of the tracks, the voting weight of each face is kept consistent, and the influence on the accuracy rate caused by the difference of the face quality is ignored, so that the index result of the face with low quality damages the voting comprehensive result. On the other hand, in video pictures, picture skipping, target occlusion, and the like often occur, and particularly in video of the type of movies, dramas, and the like, these cases cause the motion trajectories of the same person to be broken into a plurality of scattered trajectories in the data associating step in the related art. Sample diversity during voting synthesis of a single track is reduced, which is unfavorable for the voting synthesis step.
Different, the technical scheme provided by the embodiment of the application can distinguish high-quality faces from low-quality faces, and improves the accuracy of comprehensive voting. Firstly, estimating the face quality of all faces in a track, setting a certain threshold value to distinguish high-quality faces from low-quality faces, subsequently improving the voting weight of the high-quality faces and reducing the voting weight of the low-quality faces, thereby relieving the adverse effect caused by the low-quality faces and finally improving the accuracy of comprehensive voting.
In addition, the technical scheme provided by the embodiment of the application can correlate a plurality of tracks and vote again for synthesis, so that the sample diversity of result synthesis is increased and the recognition recall rate is improved. And re-clustering the separated tracks to form a more complete motion track. Meanwhile, the technical scheme provided by the embodiment of the application provides a cascading voting mechanism, the voting comprehensive links are divided into two steps of intra-track voting and inter-track voting, the number of samples referenced by the voting comprehensive is increased, and more faces are identified and recalled.
The technical solutions provided in the present application will be described in detail through several embodiments. The invention provides a video face recognition method, which is used for constructing the star recognition capability under the video types of movies, television dramas, synthetic arts, cartoons and the like, outputting the face position and the star information of each frame of video picture, and finally storing face recognition data in video structured data, wherein the data is applied to the editing of personal video of target stars in videos, the editing of video of similar stars and the like, so that the video of the target stars can be independently clipped according to comprehensive videos or the face images appearing in video frames can be marked in time for prompting audiences.
Referring to fig. 3, a flowchart of a video person identification method according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device. The method may comprise at least one of the following steps (310-330):
step 310, a sequence of face images is extracted from the video, the sequence of face images including a plurality of face images.
The video type is not limited, and the video type can be videos subjected to post-processing such as movies, television shows, cartoons and the like, road surface pictures shot by similar road cameras, face pictures shot by household cameras and the like, and all videos forming continuous frames can be included in the protection scope of the application.
In some embodiments, video is decoded and video frames with timing relationships may be obtained. Optionally, a portion of the video frames are extracted from the video frames to extract the face image. Alternatively, video frames are extracted at fixed intervals, such as first frames, second frames, third frames, and so forth. Naturally, in order to reduce the processing amount and save the processing cost, n frames of video frames may be extracted (n is an integer greater than 1) at intervals, and naturally, the number of extracted video frames is reduced, which may possibly result in a reduction in accuracy of the recognition result to some extent. Therefore, the frame extraction interval can be determined by comprehensively considering the recognition accuracy and the processing cost.
In some embodiments, the extracted video frames are detected, and face boundary frame coordinates and key point recognition results in each video frame are detected. Alternatively, in some embodiments, the face bounding box is rectangular, and the face bounding box coordinates may be coordinates of four vertices of the rectangle, or may be coordinates of one vertex of the rectangle and a value of the length and width. In some embodiments, the keypoint identification result may be keypoint coordinates, where a keypoint may be understood as a point that is capable of characterizing a face, such as a five sense organ on a face. Optionally, the key point recognition result is coordinates of at least 10 points extracted from the five sense organs on the face. The application is not limited to the key points, and all points capable of representing the face can be called as key points.
In some embodiments, the face images in each video frame may be determined from face bounding box coordinates in each video frame, alternatively the number of face images in one video frame may be one or more. In some embodiments, each of the identified face images may correspond to a face image sequence of a plurality of face image sequences corresponding to previous video frames. Alternatively, from the previous i (i is a positive integer) video frames, an m (m is a positive integer) face image sequence may be obtained, and then from the face image of the i+1th video frame, the face image may be corresponding to at least one of the m face image sequences, or may be separately the first face image in the m+1th face image sequence. In some embodiments, the face image sequence corresponding to the face image is determined according to the appearance information of the face image, and in some embodiments, the face image sequence to which the face image of the current video frame belongs may also be determined according to the position information of the face boundary frame of the current video frame and the face boundary frame of the previous frame. The method for judging the face image sequence to which the face image belongs is not limited, and can be based on appearance information judgment or motion information (face boundary box information) judgment.
In some embodiments, the face boundary box coordinates and the face key point coordinates corresponding to each video frame may be obtained through a face detection model. In some embodiments, the face detection model includes, but is not limited to, at least one of RetinaFace, MTCNN, and specific detection principles are not described herein.
In the embodiment of the present application, the face image sequence may also be referred to as a face track or a track, which is not limited in this application. In some embodiments, different tracks correspond to different numbers, optionally starting with 1, each track corresponding to a number that can be uniquely characterized.
Step 320, obtaining preliminary recognition results and quality scores respectively corresponding to a plurality of face images included in the face image sequence, wherein the preliminary recognition results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images.
Preliminary identification result: the recognition result preliminarily estimated according to the face image includes the candidate character and the confidence corresponding to the candidate character, and may be understood as the probability value or the similarity corresponding to the candidate character and the candidate character. Optionally, the preliminary identification result includes a plurality of candidate characters and confidence degrees respectively corresponding to the plurality of candidate characters. Optionally, the primary recognition result of a face image is "star a,0.99; star b,0.88; star c,0.76; …). Wherein, star a, star b, and star c represent candidate characters, 0.99 represents that the confidence/similarity of the face image being star a is 0.99,0.88, 0.88,0.76 represents that the confidence/similarity of the face image being star b is 0.76. In some embodiments, the preliminary recognition result of the face image is obtained through a face recognition model.
Quality score: and (5) scoring the image quality of the face image. In some embodiments, the quality score of the face image is determined according to the brightness, definition and other information of the face image, and different levels of image quality are given according to the quality score. In other embodiments, the quality score of the face image is determined based on the face image being able to be correctly classified, and the image quality is given a ranking based on the level of the quality score. In some embodiments, the image quality is proportional to the quality score, with higher quality scores, higher image quality, and corresponding higher image quality levels. In some embodiments, the image quality is a piecewise function, e.g., the image quality of face images having a quality score of 80 or more is determined to be high quality and the image quality of face images having a quality score of less than 80 is determined to be low quality. In some embodiments, a quality score of the face image is determined by a quality assessment model, and a quality level of the face image is determined based on the quality score. In some embodiments, the quality evaluation model is trained according to a plurality of face images with correctly classified labels, adjusts the quality score according to whether the face images are correctly classified, and further adjusts parameters in the model in a gradient decreasing mode, when the face images are correctly classified, the quality score of the face images is improved, when the face images are incorrectly classified, the quality score of the face images is reduced, and the quality evaluation model trained by using a plurality of face images with correctly classified labels can be used for evaluating the face images without labels and is used for quality score of the face images in the embodiments of the application.
Referring to fig. 4, a schematic diagram of quality scores of face images provided in one embodiment of the present application is shown. Wherein 41, 42, 43, 44 are a plurality of face images, and quality scores corresponding to the plurality of face images and face quality levels determined according to the quality scores can be obtained through a quality evaluation model. The face quality level reflects the image quality of the face image, and is optionally divided into a plurality of levels according to the image quality, wherein the face quality level comprises high quality, medium quality and low quality. As shown in fig. 4, it can be seen that the quality scores of the face image 41 and the face image 42 are 80 and 82, respectively, and thus the quality scores of the face image 41 and the face image 42 are high, the quality scores of the face image 43 and the face image 44 are 18 and 0, respectively, and the quality scores of the face image 43 and the face image 44 are low, and thus the quality scores of the face image 43 and the face image 44 are low.
In some embodiments, the quality of the face may be evaluated using a quality assessment model as shown in fig. 4, where a quality score for the face image is obtained. The quality evaluation model inputs face images with the faces aligned, and after the model evaluates the quality of the face images, the quality scores corresponding to the face images are output, and the value range is an integer from 0 to 100. In some embodiments, all detected faces are subjected to quality estimation by adopting a quality evaluation model, so that quality scores of each face are obtained and are used for subsequent track clustering and cascade voting links, and the following embodiments are specifically referred to. And secondly, distinguishing high-quality faces from low-quality faces. A certain threshold is set, faces with high quality are defined as being larger than the threshold, and faces with low quality are defined as being lower than the threshold.
In some embodiments, step 320 includes at least one of the following steps (320-2-320-6, not shown).
Step 320-2: and (3) according to the key point identification result of the face image, carrying out key point alignment to obtain an aligned face image.
In some embodiments, a key point recognition result of the face image is obtained according to the face detection model, wherein the key point recognition result includes coordinate information of a plurality of key points.
The key point alignment refers to aligning key points obtained from a face image with key points of a frontal standard face shape, namely converting a face image which is possibly not frontal such as a side face into a frontal face image. In some embodiments, the key point alignment is adopted to enable the non-frontal face image to be converted into the frontal face image, so that the accuracy of face recognition is improved, and the recognition rate of the target person is improved.
Step 320-4: and obtaining feature information corresponding to the face image from the aligned face image through the face recognition model, and determining a preliminary recognition result corresponding to the face image according to the feature information corresponding to the face image and the feature information of each object contained in the feature library.
In some embodiments, feature information of the aligned face images may be extracted according to a face recognition model. In some embodiments, the feature information is a feature vector, optionally, a 512-dimensional feature vector corresponding to the face image may be obtained from the aligned face image through a face recognition model, where the 512-dimensional feature vector may represent the face image.
In some embodiments, the feature information of the face image is compared with the feature information of each object contained in the feature library, and a preliminary recognition result of the face image is determined according to the comparison result. In some embodiments, the feature library includes a plurality of objects and feature information corresponding to the plurality of objects, respectively, wherein the plurality of objects correspond to the correct recognition result or the target person, respectively. Optionally, the feature information of the face image is compared with the feature information of each object contained in the feature library, and the object closest to the face image is determined according to the similarity between the feature information. Optionally, the similarity between the feature information includes, but is not limited to, cosine similarity, euclidean distance. In some embodiments, the preliminary recognition result of the face image is also referred to as a face index result.
In some embodiments, for 512-dimensional feature vectors corresponding to the face image, the feature vectors are normalized into one-dimensional vectors, similarity is calculated with the vectors in the feature library, for example, cosine similarity is calculated, the calculated similarity results are arranged in descending order, and topN (the first N, N is an integer greater than 1) similarity results are corresponding to the object to determine the preliminary recognition result of the face image.
Step 320-6: and processing the aligned face images through a quality evaluation model to obtain quality scores corresponding to the face images.
In some embodiments, the aligned face images are input to a quality assessment model, through which a quality score corresponding to the face images can be obtained, and optionally, the quality score may include, but is not limited to, 0-100, and may also be 0-10. In some embodiments, the quality level corresponding to the face image may also be obtained through a quality evaluation model, for example, an aligned face image is input, and the quality level of the face image is obtained to be high quality.
In the technical scheme provided by the embodiment of the application, the key point recognition result is extracted from the face image, the characteristic information of the face image is obtained based on the key point recognition result, the preliminary recognition result of the face image is determined based on the characteristic information, the preliminary recognition result can be more accurate, and the target person corresponding to the face image sequence determined based on the preliminary recognition result is more close to the person prototype in the image.
And 330, voting candidate characters corresponding to the face images respectively according to the quality scores corresponding to the face images respectively, and determining target characters corresponding to the face image sequence.
The number of the target characters is not limited, and may be only one or at least two.
In some embodiments, there are a plurality of face images in a face image sequence, each face image corresponds to a primary recognition result, that is, each image corresponds to a plurality of candidate persons, if the quality score of the face image is higher, the image quality of the face image is higher, and correspondingly, the voting weight of the primary recognition result corresponding to the face image is also higher, that is, the voting weight of the primary recognition result of the face image is proportional to the image quality of the face image. In some embodiments, the face image sequence includes four face images, where the voting weights corresponding to the four face images are q1, q2, q3, and q4, and taking the first face image as an example, the primary recognition result of the first face image includes two candidate characters, where the confidence degrees of the candidate characters are z1 and z2, so that the primary recognition result of the first face image is voted, and q1 is multiplied by z1 and z2 to obtain the voted result for the first face image. And synthesizing voting results of the recognition results of the face images in the face image sequence, and determining a target person corresponding to the face image sequence.
According to the technical scheme provided by the embodiment of the application, the face position information corresponding to each video frame and the face recognition information recognized according to the face image can be finally obtained, and the face recognition information can be understood as track information and target character information.
The recall rate mentioned in the embodiments of the present application refers to the ratio of the number of video frames recalled from a video for the same target person to the number of video frames in which the target person appears in the video. Of course, the recall rate is also understood as the ratio of the kind of the target person recalled from the video to the kind of the target person actually appearing in the video for the same video.
According to the technical scheme provided by the embodiment of the application, the face image sequences are extracted from the video, each face image sequence comprises a plurality of face images, candidate characters corresponding to the plurality of face images are voted according to the quality scores of the plurality of face images belonging to the same face image sequence, and the target characters corresponding to the face image sequence are determined according to the voting result. According to the method and the device, the identification result of the face image is voted according to the image quality of the face image, and the target person corresponding to the face image sequence formed by a plurality of face images is determined according to the voting result, namely, the voting result of the candidate person corresponding to the face image is influenced by the image quality of the face image. Therefore, according to the technical scheme provided by the embodiment of the application, based on the preliminary identification result of the face image, the image quality of the face image is used as an important factor affecting the voting result of the candidate characters corresponding to the face image, so that the finally determined target characters are more accurate, and the recall rate of the characters in the video is higher.
Referring to fig. 5, a flowchart of a video person recognition method according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device. The method may comprise at least one of the following steps (310-336):
step 310, a sequence of face images is extracted from the video, the sequence of face images including a plurality of face images.
Step 322, obtaining a preliminary recognition result and a quality score respectively corresponding to a plurality of face images included in the face image sequence, where the preliminary recognition result is used to indicate candidate characters corresponding to the face images, the quality score is used to indicate image quality of the face images, and the preliminary recognition result includes: at least one candidate character corresponding to the face image, and a confidence level corresponding to the at least one candidate character respectively.
Step 332, determining voting weights corresponding to the face images according to the quality scores corresponding to the face images.
Optionally, setting voting weights corresponding to face images with quality scores smaller than a threshold value as first numerical values; setting voting weights corresponding to face images with quality scores larger than a threshold value as second numerical values; wherein the second value is greater than the first value.
In some embodiments, the voting weights are divided into two steps, the voting weights corresponding to face images with quality scores smaller than the threshold value are determined to be lower first numerical values, and the voting weights corresponding to face images with quality scores larger than the threshold value are determined to be higher second numerical values. Optionally, the threshold value is 80, the first value is 0, and the second value is a reciprocal of the number of face images greater than the threshold value. The voting weight corresponding to the face image with the quality score smaller than 80 is determined to be a first value of 0, and the voting weight corresponding to the face image with the quality score larger than 80 is determined to be a second higher value. Optionally, there are four face images in the face image sequence, wherein the quality score of three face images is greater than 80, only one face image has a quality score less than 80, the voting weight of the face image with the quality score less than 80 is determined to be 0, and the voting weights of the other three face images are respectively determined to be 1/3, wherein 3 represents the number of face images greater than a threshold value.
In some embodiments, when the quality score is equal to the threshold value, the voting weight of the face image may be considered as a first value, or the voting weight of the face image may be considered as a second value. In some embodiments, when accuracy of the voting results is considered, the voting weight of the face image may be considered as a first value when the quality score is equal to a threshold value. In some embodiments, when the diversity of samples involved in voting is considered, the voting weight of the face image may be considered as a second value when the quality score is equal to a threshold value.
In some embodiments, the quality score may also be directly used as a voting weight, which is proportional to the quality score. In some embodiments, the voting weights are not limited to the first numerical value and the second numerical value, and at least three voting weights can be determined according to the difference of the quality scores.
According to the technical scheme provided by the embodiment of the application, the voting weight of the face image smaller than the threshold value is reduced, the voting weight of the face image larger than the threshold value is improved, so that the face image with lower quality score does not participate in voting as much as possible, namely, the voting of the primary recognition result is carried out on the high-quality face image as much as possible, and the accuracy of the final face recognition result can be improved based on the voting of the high-quality face image. Meanwhile, voting weights are distinguished according to the face quality, so that voting modes are novel and various.
Step 334, determining the target confidence degrees respectively corresponding to the candidate persons according to the voting weights respectively corresponding to the face images, the at least one candidate person respectively corresponding to the face images and the confidence degrees respectively corresponding to the candidate persons.
In some embodiments, step 334 includes at least one of the following steps (334-2-334-4, not shown).
Step 334-2: for each face image in the plurality of face images, the confidence coefficient corresponding to at least one candidate person corresponding to the face image is multiplied by the voting weight corresponding to the face image, and the intermediate confidence coefficient corresponding to the at least one candidate person is obtained.
In some embodiments, the candidate persons in a face image have the confidence degrees of r1, r2 and r3 respectively, d1, d2 and d3 (d 1, d2 and d3 are all positive numbers), and the voting weight of the face image is q5 (q 5 is a positive number), then the confidence degree d1 of the candidate person r1 is multiplied by q5, the confidence degree d2 of the candidate person r2 is multiplied by q5, the confidence degree d3 of the candidate person r3 is multiplied by q5, so that the intermediate confidence degree of the candidate person r1 is d1 q5, the intermediate confidence degree of the candidate person r2 is d2 q5, and the intermediate confidence degree of the candidate person r3 is d3 q5.
Step 334-4: and adding the intermediate confidence degrees corresponding to the candidate characters corresponding to the face images to obtain the target confidence degrees corresponding to the candidate characters.
In some embodiments, candidate persons having two face images in the face image sequence have a person P, where P has a middle confidence level of P1 in the first face image and a middle confidence level of P2 in the second face image, and the target confidence level of the candidate person P is p1+p2 (both P1 and P2 are positive numbers).
And 336, determining the candidate character with the target confidence degree meeting the first condition as the target character corresponding to the face image sequence.
In some embodiments, according to the preliminary recognition result and the voting weight of the face image, obtaining a target confidence coefficient of each candidate person, and determining the candidate person with the target confidence coefficient meeting the first condition, wherein the target person corresponds to the face image sequence.
In some embodiments, the first condition is a maximum value of the target confidence. Optionally, determining the candidate character corresponding to the maximum value of the target confidence coefficient, and determining the target character corresponding to the face image sequence.
In some embodiments, the first condition is to rank the target confidence in descending order, the position of the target confidence in the ordered queue being the first X (X is an integer greater than 1). Alternatively, the candidate person corresponding to the confidence degrees of the first 5 positions is determined as the target person corresponding to the face image sequence. The technical scheme provided by the embodiment of the application enriches the determination modes of the target person and improves the accuracy of the determined target person.
Referring to fig. 6, a schematic diagram of intra-track voting provided by one embodiment of the present application is shown. According to fig. 6, the method comprises at least one of the following steps (61-63).
Step 61, obtaining a preliminary recognition result and a face quality score, and distinguishing high-quality faces from low-quality faces according to a threshold value.
Firstly, preliminary identification results (face index results) corresponding to four face images respectively are obtained, wherein each identification result comprises a plurality of candidate characters and confidence degrees corresponding to the candidate characters respectively, for example, the front five face index results of the first face image in the figure are "star a1,0.92; star a2,0.87; star a3,0.84; star a4,0.83; star a5,0.80", wherein star a1, star a2, star a3, star a4, star a5 represent the first five candidate persons, it can be found that the confidence level of their correspondence is also decreasing. In this embodiment of the present application, topN (first N, N is a positive integer) recognition results may be selected, or all recognition results may be selected as preliminary recognition results, which is not limited in this application.
And step 62, different weights are given to the faces with different qualities according to a certain strategy.
The quality scores of the first three face images in the four face images of fig. 6 are all 60 or more, so that the image quality of the first three face images is determined to be high quality, whereas the quality score of the fourth face image is only 10, so that the fourth face image is determined to be low quality. The voting weights of the three high-quality face images are determined to be 33.3 percent (namely 1/3), the voting weights of the low-quality face images are determined to be 0, namely equal-weight voting (consistent voting weights) is carried out among the high-quality face images without considering the low-quality face images.
Step 63, performing weighted average on the first five recognition results of all the faces according to the weights, and taking the star with the highest confidence as the recognition star of the track, wherein the confidence is taken as the recognition confidence of the track.
From fig. 6, it can be seen that the result of voting on three face images is: star a1,0.93; star a2,0.88; star a3,0.85; star a4,0.82; star a6,0.84; star a7,0.80; star a5,0.80; star a8,0.80. Since star a1 with the highest confidence is used as the identification star (also referred to as the target person) of the track, star a1 is used as the identification confidence (also referred to as the target confidence) of the track, and the confidence is 0.93.
According to the technical scheme provided by the embodiment of the application, the voting weights corresponding to the face images are distinguished by setting the threshold value, the target person of the face image sequence is determined according to the product of the confidence coefficient and the voting weights based on the candidate person and the corresponding confidence coefficient in the primary recognition result, the voting weights for the low-quality face can be reduced, and the low-quality face does not participate in voting or does not participate in voting as much as possible, so that the accuracy of the recognition result is improved.
Referring to fig. 7, a flowchart of a video person recognition method according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device. The method may comprise at least one of the following steps (310-370):
Step 310, a sequence of face images is extracted from the video, the sequence of face images including a plurality of face images.
Step 320, obtaining preliminary recognition results and quality scores respectively corresponding to a plurality of face images included in the face image sequence, wherein the preliminary recognition results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images.
And 330, voting candidate characters corresponding to the face images respectively according to the quality scores corresponding to the face images respectively, and determining target characters corresponding to the face image sequence.
Step 340, determining the representative face images respectively corresponding to the face image sequences according to the quality scores respectively corresponding to the face images included in the face image sequences, wherein the number of the face image sequences extracted from the video is a plurality of.
In some embodiments, for each face image sequence, the face image with the largest quality score in the face image sequence is determined as the representative face image corresponding to the face image sequence. The face image with the largest quality score is determined to be the representative face image to be clustered, the face image sequence can be accurately represented by thinking, and the final clustering result is more accurate.
In some embodiments, the front X face images of the quality scores in each face image sequence may be determined as the representative face images corresponding to the face image sequence, and the number of the representative face images is not limited in this application.
And 350, clustering the representative face images respectively corresponding to the face image sequences to obtain at least one cluster, wherein each cluster comprises at least one representative face image.
Clustering is carried out according to the feature information corresponding to each representative face image respectively and based on the similarity between the feature information, so as to obtain at least one cluster; the representative face images with the quality scores larger than the threshold value participate in the clustering, and the representative face images with the quality scores smaller than the threshold value do not participate in the clustering.
In some embodiments, the representative face images with some tracks may also have relatively low image quality, a threshold is set based on accuracy considerations, the representative face images below the threshold do not participate in clustering, and the representative face images above the threshold participate in clustering, which is also to prevent inaccuracy in the clustering result due to clustering of the representative face images with quality scores that are too low. In some embodiments, the threshold is set to 80, i.e., representative face images with a quality score greater than 80 participate in the cluster, while representative face images with a quality score less than 80 do not participate in the cluster. In some embodiments, the representative face images with quality scores equal to the threshold may or may not participate in clustering, which is not limited in this application. In some embodiments, the representative face images for which the quality score is equal to the threshold do not participate in the clustering, taking into account the accuracy of the clustering. In other embodiments, the clustering is participated in for representative face images having a quality score equal to a threshold, taking into account the diversity of the samples participated in the clustering.
According to the technical scheme provided by the embodiment of the application, only the representative face images are clustered, compared with the case that all face images participate in clustering, the number of the face images participating in clustering can be effectively reduced, the processing cost is reduced, and the expenditure is reduced.
In some embodiments, the feature information is feature vectors, and the clustering is performed according to a similarity between feature vectors representing the face image. Optionally, the similarity includes, but is not limited to, cosine similarity, euclidean distance.
Step 360, merging the face image sequences to which at least one representative face image belonging to the same cluster belongs respectively to obtain a face image sequence set.
Referring to fig. 8, a schematic diagram of track merging is shown as provided in one embodiment of the present application. The method shown in fig. 8 includes at least one of the following steps (81 to 83).
And step 81, screening out the representative face image according to the quality scores in each track.
And sorting the face quality scores of the face tracks, and screening one or more faces with the highest face quality scores as representative face images of the corresponding tracks. When the highest quality score is below the predetermined quality score, then no further aggregation of the tracks is performed. For example, in fig. 8, a representative face image is selected for each of the tracks 1, 2, and 3.
And step 82, clustering the characteristics representing the face image by using a clustering algorithm.
Firstly, acquiring face characteristic information of each track representing a face image, clustering the face characteristic information by applying a clustering algorithm (such as DBSCAN), dividing the face characteristic information into different face clustering clusters according to the similarity of faces, and aggregating faces with high similarity.
As shown in fig. 8, three representative face images are clustered to obtain two clusters, wherein the first cluster includes the representative face image of the track 1 and the representative face image of the track 2, and the second cluster includes only the representative face image of the track 3.
And step 83, merging the tracks corresponding to the representative face images of the same cluster.
And merging the tracks corresponding to each representative face image in each cluster to obtain a more complete track corresponding to the same target person. Finally, further aggregation of the tracks is achieved.
As shown in fig. 8, the track 1 and the track 2 corresponding to the representative face image of the first cluster are combined to obtain a combined track.
According to the technical scheme provided by the embodiment of the application, the dispersed tracks are combined together, so that a more complete track based on the same target person can be obtained, meanwhile, part of face images can participate in recognition in a mode of extracting the representative face images, and sample diversity of the target person is improved to a certain extent.
Step 370, determining the target person corresponding to the face image sequence set according to the target person corresponding to each face image sequence in the face image sequence set.
In some embodiments, step 370 includes at least one of the following steps (370-2-370-6, not shown).
Step 370-2: and determining voting weights respectively corresponding to the face image sequences in the face image sequence set.
In some embodiments, equal-weight votes are made for individual face image sequences in the set of face image sequences, i.e., the voting weights for each face image sequence are uniform.
In some embodiments, the voting weight of the face image sequence is determined according to the quality score of the representative face image of the face image sequence, optionally, the quality score of the representative face image is taken as the voting weight of the face image sequence, or the voting weight of the face image sequence is proportional to the quality score of the face image.
Step 370-4: and determining the confidence coefficient corresponding to the target person of at least one candidate according to the voting weight corresponding to each face image sequence and the confidence coefficient of the target person corresponding to each face image sequence.
The voting weights corresponding to the face image sequences are multiplied by the confidence coefficients of the target characters corresponding to the face image sequences to obtain weighted confidence coefficients corresponding to the target characters; and adding the weighted confidence degrees corresponding to the same target person to obtain the confidence degrees corresponding to the target persons of at least one candidate respectively.
Step 370-6: and determining the target person with the confidence degree meeting the second condition as the target person corresponding to the face image sequence set.
In the embodiment of the present application, the second condition is similar to the first condition, and the second condition is discussed in more detail in the above embodiment, so the description about the first condition is omitted here. The number of target persons corresponding to the merging track or the face image sequence set is not limited in this application, and may be one or more.
Referring to fig. 9, a schematic diagram of inter-track voting provided by one embodiment of the present application is shown. The method shown in fig. 9 includes at least one of the following steps (91 to 94).
In the embodiment of the application, the cascade voting refers to intra-track voting plus inter-track voting, and in the related art, only the intra-track voting is performed, but the inter-track voting is not performed, and the final recognition result of the track is not readjusted based on the result of the inter-track voting.
Step 91, obtaining the voting result in the track.
Step 92, obtaining the result of track clustering.
And 93, voting the identification results among the tracks in each cluster of the track clusters to finish the clustering among the tracks.
As shown in fig. 9, the equal-weight voting is performed on each track among tracks, for example, the voting result of the combined track 1 is calculated by: the specific calculation method is similar to the voting in the track, and is not repeated here, but 0.90×1/2+0.92×1/2=0.91.
Step 94, voting among tracks is completed, and a final recognition result of the combined tracks is obtained.
As shown in fig. 9, finally, the star with the highest voting confidence is taken as the star recognition result of the combined track, and the confidence is taken as the recognition confidence of the combined track. The result of the inter-track voting is taken as the final output of the cascade voting.
In the embodiment of the present application, because the situation that the face is blocked may occur in the video, the track of the face is interrupted, in the related art, when the track is interrupted, the tracks are not recombined together, and for the situation that the track is spaced, a plurality of tracks corresponding to the same person are output finally. In the embodiment of the application, the same tracks are related together in a clustering mode, so that a more complete track can be formed. The related art votes have only votes in the tracks and have no votes between the tracks, which means that if the recognition result in the tracks is judged to be wrong, the output result is wrong and cannot be corrected. According to the method and the device for voting, voting is carried out in a mode of combining intra-track voting with inter-track voting, errors of intra-track voting can be corrected to a certain extent, voting synthesis is carried out by using more face track information, and accuracy of a voting mechanism in track recognition is improved. In addition, the faces which are not recalled in part of tracks can be recalled through voting among tracks, so that the recall rate can be improved.
Referring to fig. 10, a block diagram of a video person recognition method according to an embodiment of the present application is shown. The execution subject of each step of the method may be the terminal device 10 in the implementation environment of the scheme shown in fig. 1, for example, the execution subject of each step may be a client of the target application program, or may be the server 20 in the implementation environment of the scheme shown in fig. 1. In the following method embodiments, for convenience of description, only the execution subject of each step is described as "computer device". The method may comprise at least one of the following steps (S1-S8).
Step S1: and decoding the video to obtain video frames with time sequence relations. In particular, to balance recognition accuracy and processing rate, video decoding may be decimated at fixed intervals to reduce the number of processed video frames, where fixed intervals of 1, 2, and 3 frames are typically employed.
Step S2: and (3) face detection, namely inputting all decoded video frame images into a face detection model, detecting faces in pictures by the model, and outputting face boundary frame coordinates and face key point coordinates of each frame of picture. The face detection model may adopt, but is not limited to, retinaFace, MTCNN and the like.
Step S3: face feature extraction and face indexing. Firstly, according to the boundary frame coordinates and the key point coordinates of the detected face in the second step, the face image is intercepted and the face is deformed to realize face alignment. And then, inputting the aligned face image into a face recognition model to obtain face features with fixed length, and taking the face features as appearance characterization of the face image. Finally, face indexing is carried out in the constructed face feature base of the star, the extracted face features are utilized to be compared with features in the base, star information of features with highest similarity in the base is used as a face index result of the face image, and the similarity is used as confidence of the index. And setting a fixed threshold, and indexing the faces with confidence coefficient larger than the threshold as the star faces and indexing the faces with confidence coefficient smaller than the threshold as the non-star faces. The face recognition model can be, but not limited to, cosFace, arcFace model, and the face index can be, but not limited to, faiss tool.
Step S4: and (5) face quality assessment. And (3) inputting the face image with the face aligned in the third step into a face quality model, evaluating the image quality by the model, and outputting the quality score of each face, wherein the face quality and the quality score are positively correlated. Models such as, but not limited to, EQFace, etc. may be used herein.
Step S5: a human face track. And carrying out data association on the detected faces with the time sequence relationship to obtain a series of face tracks, recording the state of a person in the video by each track, and simultaneously keeping the boundary box, the face characteristics and the face index result of each face in the track. Methods employed herein include, but are not limited to, deep sort et al.
Step S6: and (5) clustering the tracks. And screening a certain number of faces from each track according to a certain strategy, carrying out clustering operation on all the screened faces, and merging tracks corresponding to the faces in the same cluster to obtain a more complete motion track. Clustering methods employed herein include, but are not limited to, DBSCAN and the like.
Step S7: cascading votes, the links comprising intra-track votes and inter-track votes. And voting in the track is to integrate the identification results of each track in the fifth step to obtain the star identification result and the identification confidence of the track. And in the inter-track voting step, the identification results of tracks after the track clustering is completed in the sixth step are integrated, and the integrated results of the intra-track voting are further aggregated to obtain a final identification result.
Step S8: and outputting a final recognition result and storing the final recognition result in the video structural information.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Referring to fig. 11, a block diagram of a video character recognition apparatus according to an embodiment of the present application is shown. The apparatus 900 may include: an image extraction module 910, a result acquisition module 920, and a person determination module 930.
The image extraction module 910 is configured to extract a face image sequence from a video, where the face image sequence includes a plurality of face images.
The result obtaining module 920 is configured to obtain preliminary recognition results and quality scores, which respectively correspond to a plurality of face images included in the face image sequence, where the preliminary recognition results are used to indicate candidate characters corresponding to the face images, and the quality scores are used to indicate image quality of the face images.
The person determining module 930 is configured to vote on candidate persons corresponding to the face images according to the quality scores corresponding to the face images, and determine a target person corresponding to the face image sequence.
In some embodiments, the preliminary identification result includes: at least one candidate character corresponding to the face image, and confidence degrees respectively corresponding to the at least one candidate character.
In some embodiments, as shown in FIG. 12, the personality determination module 930 includes a weight determination unit 932, a confidence determination unit 934, and a personality determination unit 936.
The weight determining unit 932 is configured to determine voting weights corresponding to the face images respectively according to quality scores corresponding to the face images respectively.
The confidence determining unit 934 is configured to determine, according to the voting weights respectively corresponding to the face images, at least one candidate person respectively corresponding to the face images, and the confidence degrees respectively corresponding to the candidate persons, a target confidence degree respectively corresponding to the candidate persons.
The person determining unit 936 is configured to determine a candidate person whose target confidence degree satisfies a first condition as a target person corresponding to the face image sequence.
In some embodiments, the weight determining unit 932 is configured to set a voting weight corresponding to the face image with the quality score smaller than a threshold value to a first value; setting voting weights corresponding to the face images with the quality scores larger than a threshold value as second numerical values; wherein the second value is greater than the first value.
In some embodiments, the confidence determining unit 934 is configured to multiply, for each face image of the plurality of face images, a confidence level corresponding to at least one candidate person corresponding to the face image with a voting weight corresponding to the face image, to obtain an intermediate confidence level corresponding to the at least one candidate person.
The confidence coefficient determining unit 934 is configured to add the intermediate confidence coefficients corresponding to the candidate persons corresponding to the face images, so as to obtain the target confidence coefficients corresponding to the candidate persons.
In some embodiments, as shown in fig. 10, the result acquisition module 920 includes a keypoint alignment unit 922, a result determination unit 924, and a score determination unit 926.
The key point alignment unit 922 is configured to perform key point alignment according to the key point recognition result of the face image, so as to obtain an aligned face image.
The result determining unit 924 is configured to obtain feature information corresponding to the face image from the aligned face image through a face recognition model, and determine a preliminary recognition result corresponding to the face image according to the feature information corresponding to the face image and feature information of each object included in a feature library.
The score determining unit 926 is configured to process the aligned face image through a quality evaluation model, so as to obtain a quality score corresponding to the face image.
In some embodiments, the number of face image sequences extracted from the video is a plurality.
In some embodiments, as shown in fig. 12, the apparatus further comprises: a representative image determination module 940, an image clustering module 950, and a merging module 960.
The representative image determining module 940 is configured to determine, according to quality scores corresponding to a plurality of face images included in each of the face image sequences, representative face images corresponding to each of the face image sequences.
The image clustering module 950 is configured to cluster the representative face images corresponding to the face image sequences respectively to obtain at least one cluster, where each cluster includes at least one representative face image.
The merging module 960 is configured to merge face image sequences to which at least one representative face image belonging to the same cluster belongs respectively, so as to obtain a face image sequence set.
The person determining module 930 is further configured to determine a target person corresponding to the face image sequence set according to the target person corresponding to each face image sequence in the face image sequence set.
In some embodiments, the representative image determining module 940 is configured to determine, for each face image sequence, a face image with the largest quality score in the face image sequence as a representative face image corresponding to the face image sequence.
In some embodiments, the image clustering module 950 is configured to cluster according to feature information corresponding to each of the representative face images, based on similarity between the feature information, to obtain at least one cluster; wherein, the representative face images with the quality scores larger than the threshold value participate in the clustering, and the representative face images with the quality scores smaller than the threshold value do not participate in the clustering.
In some embodiments, the weight determining unit 932 is configured to determine a voting weight corresponding to each of the face image sequences in the face image sequence set.
The confidence determining unit 934 is configured to determine a confidence level corresponding to each of the target persons of the at least one candidate according to the voting weights corresponding to each of the face image sequences and the confidence levels of the target persons corresponding to each of the face image sequences.
The person determining unit 936 is configured to determine a target person whose confidence degree satisfies a second condition as a target person corresponding to the face image sequence set.
In some embodiments, the confidence determining unit 934 is configured to multiply the voting weights corresponding to the face image sequences respectively with the confidence degrees of the target persons corresponding to the face image sequences respectively, to obtain weighted confidence degrees corresponding to the target persons respectively; and adding the weighted confidence degrees corresponding to the same target person to obtain the confidence degrees corresponding to the target persons of at least one candidate respectively.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to FIG. 13, a block diagram of a computer device 2100 is provided in accordance with one embodiment of the present application.
In general, the computer device 2100 includes: a processor 2101 and a memory 2102.
The processor 2101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 2101 may be implemented in hardware in at least one of a DSP (Digital Signal Processing ), an FPGA (Field Programmable Gate Array, field programmable gate array), and a PLA (Programmable Logic Array ). The processor 2101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 2101 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of the content that the display screen is required to display. In some embodiments, the processor 2101 may also include an AI processor for processing computing operations related to machine learning.
Memory 2102 may include one or more computer-readable storage media, which may be non-transitory. Memory 2102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 2102 is used to store a computer program configured to be executed by one or more processors to implement the video person identification method described above.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is not limiting as to the computer device 2100, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.
In an exemplary embodiment, a computer readable storage medium is also provided, in which a computer program is stored which, when executed by a processor, implements the above video person identification method.
Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random Access Memory ), SSD (Solid State Drives, solid state disk), or optical disk, etc. The random access memory may include, among other things, reRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ).
In an exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device executes the video person recognition method described above.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.
The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims (14)

1. A method of video character recognition, the method comprising:
extracting a face image sequence from a video, wherein the face image sequence comprises a plurality of face images;
Acquiring preliminary identification results and quality scores, which are respectively corresponding to a plurality of face images included in the face image sequence, wherein the preliminary identification results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images;
and voting candidate characters corresponding to the face images respectively according to the quality scores corresponding to the face images respectively, and determining target characters corresponding to the face image sequence.
2. The method of claim 1, wherein the preliminary identification result comprises: at least one candidate character corresponding to the face image and a confidence level corresponding to the at least one candidate character respectively;
voting candidate characters corresponding to the face images respectively according to the quality scores corresponding to the face images respectively, and determining target characters corresponding to the face image sequence, wherein the voting process comprises the following steps:
determining voting weights respectively corresponding to the face images according to the quality scores respectively corresponding to the face images;
according to the voting weights respectively corresponding to the face images, at least one candidate person respectively corresponding to the face images and the confidence degrees respectively corresponding to the candidate persons, determining the target confidence degrees respectively corresponding to the candidate persons;
And determining the candidate characters with the target confidence degree meeting the first condition as target characters corresponding to the face image sequence.
3. The method of claim 2, wherein determining the voting weights respectively corresponding to the face images according to the quality scores respectively corresponding to the face images comprises:
setting voting weights corresponding to the face images with the quality scores smaller than a threshold value as first numerical values;
setting voting weights corresponding to the face images with the quality scores larger than a threshold value as second numerical values;
wherein the second value is greater than the first value.
4. The method of claim 2, wherein determining the target confidence level for each of the candidate persons according to the voting weights for each of the plurality of face images, the at least one candidate person for each of the plurality of face images, and the confidence level for each of the candidate persons, comprises:
for each face image in the face images, multiplying the confidence coefficient corresponding to at least one candidate person corresponding to the face image by the voting weight corresponding to the face image to obtain the intermediate confidence coefficient corresponding to the at least one candidate person;
And adding the intermediate confidence degrees corresponding to the candidate characters respectively corresponding to the face images to obtain target confidence degrees corresponding to the candidate characters respectively.
5. The method according to claim 1, wherein the obtaining preliminary recognition results and quality scores respectively corresponding to a plurality of face images included in the face image sequence includes:
according to the key point identification result of the face image, carrying out key point alignment to obtain an aligned face image;
obtaining feature information corresponding to the face image from the aligned face image through a face recognition model, and determining a primary recognition result corresponding to the face image according to the feature information corresponding to the face image and the feature information of each object contained in a feature library;
and processing the aligned face images through a quality evaluation model to obtain quality scores corresponding to the face images.
6. The method of claim 1, wherein the number of face image sequences extracted from the video is a plurality, the method further comprising:
according to quality scores corresponding to a plurality of face images respectively included in each face image sequence, determining representative face images corresponding to each face image sequence respectively;
Clustering the representative face images corresponding to the face image sequences respectively to obtain at least one cluster, wherein each cluster comprises at least one representative face image;
combining face image sequences to which at least one representative face image belonging to the same cluster belongs respectively to obtain a face image sequence set;
and determining the target characters corresponding to the face image sequence set according to the target characters respectively corresponding to the face image sequences in the face image sequence set.
7. The method according to claim 6, wherein the determining the representative face image respectively corresponding to each of the face image sequences according to the quality scores respectively corresponding to the plurality of face images included in each of the face image sequences includes:
and for each face image sequence, determining the face image with the largest quality score in the face image sequence as a representative face image corresponding to the face image sequence.
8. The method according to claim 6, wherein clustering the representative face images corresponding to the face image sequences to obtain at least one cluster includes:
Clustering according to the characteristic information corresponding to each representative face image and based on the similarity between the characteristic information to obtain at least one cluster;
wherein, the representative face images with the quality scores larger than the threshold value participate in the clustering, and the representative face images with the quality scores smaller than the threshold value do not participate in the clustering.
9. The method of claim 6, wherein the determining the target person corresponding to the face image sequence set according to the target person corresponding to each face image sequence in the face image sequence set includes:
determining voting weights corresponding to the face image sequences in the face image sequence set respectively;
determining the confidence coefficient corresponding to the target characters of at least one candidate according to the voting weight corresponding to each face image sequence and the confidence coefficient of the target character corresponding to each face image sequence;
and determining the target person with the confidence degree meeting the second condition as the target person corresponding to the face image sequence set.
10. The method of claim 9, wherein determining the confidence level of each of the at least one candidate target person according to the voting weight of each of the face image sequences and the confidence level of each of the target persons corresponding to each of the face image sequences, comprises:
Multiplying the voting weight corresponding to each face image sequence with the confidence coefficient of the target person corresponding to each face image sequence to obtain the weighted confidence coefficient corresponding to each target person;
and adding the weighted confidence degrees corresponding to the same target person to obtain the confidence degrees corresponding to the target persons of at least one candidate respectively.
11. A video character recognition apparatus, the apparatus comprising:
the image extraction module is used for extracting a face image sequence from the video, wherein the face image sequence comprises a plurality of face images;
the device comprises a result acquisition module, a quality score and a recognition module, wherein the result acquisition module is used for acquiring preliminary recognition results and quality scores respectively corresponding to a plurality of face images included in the face image sequence, the preliminary recognition results are used for indicating candidate characters corresponding to the face images, and the quality scores are used for indicating the image quality of the face images;
and the character determining module is used for voting candidate characters respectively corresponding to the face images according to the quality scores respectively corresponding to the face images, and determining target characters corresponding to the face image sequence.
12. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1 to 10.
13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the method of any of the preceding claims 1 to 10.
14. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the method according to any one of claims 1 to 10.
CN202210908478.9A 2022-07-29 2022-07-29 Video character recognition method, device, equipment and storage medium Pending CN117523625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210908478.9A CN117523625A (en) 2022-07-29 2022-07-29 Video character recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210908478.9A CN117523625A (en) 2022-07-29 2022-07-29 Video character recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117523625A true CN117523625A (en) 2024-02-06

Family

ID=89757231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210908478.9A Pending CN117523625A (en) 2022-07-29 2022-07-29 Video character recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117523625A (en)

Similar Documents

Publication Publication Date Title
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
US12094209B2 (en) Video data processing method and apparatus, device, and medium
CN111709409B (en) Face living body detection method, device, equipment and medium
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN112215171B (en) Target detection method, device, equipment and computer readable storage medium
CN111405360B (en) Video processing method and device, electronic equipment and storage medium
CN113766330A (en) Method and device for generating recommendation information based on video
CN111209897A (en) Video processing method, device and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN113761253A (en) Video tag determination method, device, equipment and storage medium
Sebyakin et al. Spatio-temporal deepfake detection with deep neural networks
AU2021240205B1 (en) Object sequence recognition method, network training method, apparatuses, device, and medium
CN114724218A (en) Video detection method, device, equipment and medium
CN116955707A (en) Content tag determination method, device, equipment, medium and program product
CN116701706B (en) Data processing method, device, equipment and medium based on artificial intelligence
CN113821678A (en) Video cover determining method and device
CN113705310A (en) Feature learning method, target object identification method and corresponding device
CN116261009A (en) Video detection method, device, equipment and medium for intelligently converting video audience
CN117523625A (en) Video character recognition method, device, equipment and storage medium
CN115115976A (en) Video processing method and device, electronic equipment and storage medium
CN115134656A (en) Video data processing method, device, equipment and medium
CN111275183A (en) Visual task processing method and device and electronic system
CN117523368A (en) Method, device, equipment and storage medium for determining object image sequence
CN113573153B (en) Image processing method, device and equipment
CN111079472A (en) Image comparison method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination