CN116563905A

CN116563905A - Object information identification method, related device, equipment and storage medium

Info

Publication number: CN116563905A
Application number: CN202210101999.3A
Authority: CN
Inventors: 俄万有; 田明; 琚蓓蓓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2023-08-08

Abstract

The application discloses an object information identification method which relates to the fields of network media, video playing and man-machine interaction, and can be applied to vehicle-mounted scenes. The method comprises the steps of obtaining a target video clip corresponding to a target frame image; acquiring associated data aiming at a target video segment, wherein the associated data comprises at least one of target participant information, target interaction information, target subtitle information and target audio data; determining a candidate object information set according to the associated data; if N objects to be identified exist in the target frame image, determining an object identification result aiming at the target frame image according to the candidate object information set. The application also provides a related device, equipment and a storage medium. The method and the device screen a part of candidate object information from a large amount of candidate object information by combining the associated data. Therefore, the range of the subsequent object matching is narrowed, the recognition speed is improved, and the recognition accuracy can be improved to a certain extent.

Description

Object information identification method, related device, equipment and storage medium

Technical Field

The present disclosure relates to the field of multimedia processing technologies, and in particular, to a method, a related device, an apparatus, and a storage medium for identifying object information.

Background

In recent years, face recognition in video has become an important research hotspot in the field of face recognition. The face recognition technology in the video has wide application prospect in the security field, intelligent identity authentication, home entertainment and other aspects. Object recognition results based on video analysis are very important for video understanding, video recommendation and the like.

Currently, video applications exist on the market that can automatically identify objects in a video frame. The method mainly comprises the steps of extracting the characteristics of human faces in video pictures through image processing technologies such as deep learning and the like, then matching specific human face characteristics one by one in a massive database, and finally identifying corresponding objects according to the similarity.

However, the inventor finds that at least the following problems exist in the solution, and because of the need to match a huge number of databases, on the one hand, the number of objects to be matched is too large, which results in too slow final recognition speed. On the other hand, the too wide matching range can lead to too many objects with high matching similarity, so that the probability of false recognition is increased.

Disclosure of Invention

The embodiment of the application provides an object information identification method, a related device, equipment and a storage medium. According to the method and the device, a part of candidate object information is screened out from a large amount of candidate object information by combining the associated data, so that the range of subsequent object matching is narrowed, the recognition speed is improved, and the recognition accuracy can be improved to a certain extent.

In view of this, the present application provides, in one aspect, a method for identifying object information, including:

acquiring a target video segment corresponding to a target frame image, wherein the target video segment comprises the target frame image;

acquiring associated data for a target video clip, wherein the associated data comprises at least one of target participant information, target interaction information, target subtitle information and target audio data;

determining a candidate object information set according to the associated data, wherein the candidate object information set comprises M candidate object information, and M is an integer greater than or equal to 1;

if N objects to be identified exist in the target frame image, determining an object identification result aiming at the target frame image according to the candidate object information set, wherein N is an integer greater than or equal to 1.

Another aspect of the present application provides a method for identifying object information, including:

playing a target video on a video playing page, wherein the video playing page displays an object identification control;

responding to touch operation aiming at an object identification control, and sending an object identification request to a server, wherein the object identification request carries a target video identifier of a target video and a target frame moment of a target frame image;

receiving an object recognition result sent by a server and aiming at a target frame image, wherein the object recognition result is determined after the server responds to an object recognition request, and the object recognition result is obtained by adopting the methods provided by the aspects;

and displaying the object identification result on the video playing page.

Another aspect of the present application provides an object information identifying apparatus, including:

the acquisition module is used for acquiring a target video clip corresponding to the target frame image, wherein the target video clip comprises the target frame image;

the acquisition module is further used for acquiring associated data aiming at the target video clip, wherein the associated data comprises at least one of target participant information, target interaction information, target subtitle information and target audio data;

The determining module is used for determining a candidate object information set according to the associated data, wherein the candidate object information set comprises M candidate object information, and M is an integer greater than or equal to 1;

the determining module is further configured to determine an object recognition result for the target frame image according to the candidate object information set if N objects to be recognized exist in the target frame image, where N is an integer greater than or equal to 1.

In one possible design, in another implementation of another aspect of the embodiments of the present application,

the acquisition module is further used for acquiring a target video before acquiring a target video fragment corresponding to the target frame image, wherein the target video comprises a T frame image, and T is an integer greater than 1;

the determining module is also used for determining a target frame image from the target video;

the acquisition module is specifically configured to extract, from the target video, a multi-frame continuous image including the target frame image as a target video segment if T is greater than or equal to the frame number threshold;

and if the T is smaller than the frame number threshold, taking the target video as a target video fragment corresponding to the target frame image.

In one possible design, in another implementation manner of another aspect of the embodiments of the present application, the object information identifying apparatus further includes a sending module;

The acquisition module is further used for acquiring an object recognition result from a database according to the object recognition request when the object recognition request sent by the terminal is received after the object recognition result for the target frame image is determined according to the candidate object information set, wherein the object recognition request carries a target video identifier of a target video and a target frame time of the target frame image, and the database records the video identifier, the frame time corresponding to the video identifier and the object recognition result corresponding to each frame time;

and the sending module is used for sending the object identification result to the terminal so as to enable the terminal to display the object identification result.

the determining module is specifically configured to receive an object identification request sent by a terminal, where the object identification request carries a target video identifier of the target video and a target frame time of the target frame image;

responding to the object identification request, and acquiring a target frame image through a first mapping relation based on the target video identification and the target frame time, wherein the first mapping relation is used for representing the mapping relation among the video identification, the frame time and the image data;

In one possible design, in another implementation of another aspect of the embodiments of the present application, the associated data includes target participant information, target interaction information, target subtitle information, and target audio data;

the determining module is specifically configured to determine a first candidate object information set according to the target participant information;

determining a second candidate object information set according to the target interaction information;

determining a third candidate object information set according to the target subtitle information;

determining a fourth candidate object information set according to the target audio data;

and performing de-duplication processing on the first candidate object information set, the second candidate object information set, the third candidate object information set and the fourth candidate object information set to obtain candidate object information sets.

the acquisition module is specifically used for acquiring a target video identifier corresponding to the target video clip;

acquiring target participant information through a second mapping relation based on the target video identification, wherein the second mapping relation is used for representing the mapping relation between the video identification and the participant information;

The determining module is specifically configured to generate a first candidate object information set according to the target participant information, where the first candidate object information set is included in the candidate object information set.

acquiring a frame time corresponding to each frame of image in a target video segment;

acquiring target interaction information through a third mapping relation based on the target video identification and the frame time corresponding to each frame image in the target video fragment, wherein the third mapping relation is used for representing the mapping relation among the video identification, the frame time and the interaction information;

the determining module is specifically configured to generate a second candidate object information set according to the target interaction information, where the second candidate object information set is included in the candidate object information set.

the determining module is specifically used for carrying out word segmentation processing on the target interaction information to obtain a word set;

obtaining P entity words from a word set, wherein P is an integer greater than or equal to 1;

Aiming at each entity word in the P entity words, if the entity word belongs to the object entity word, the entity word is used as candidate object information in the second candidate object information set;

and for each entity word in the P entity words, if the entity word belongs to the role entity word, converting the role entity word into the object entity word, and taking the object entity word as candidate object information in the second candidate object information set.

acquiring target subtitle information through a fourth mapping relation based on the target video identifier and the frame time corresponding to each frame image in the target video fragment, wherein the fourth mapping relation is used for representing the mapping relation among the video identifier, the frame time and the subtitle information;

the determining module is specifically configured to generate a third candidate object information set according to the target subtitle information, where the third candidate object information set is included in the candidate object information set.

acquiring Q role entity words from a word set, wherein Q is an integer greater than or equal to 1;

for each of the Q persona entity terms, converting the persona entity term into an object entity term, and taking the object entity term as candidate information in the third candidate information set.

the acquisition module is specifically used for acquiring the starting frame time and the ending frame time of the target video clip;

acquiring target audio data according to the starting frame time and the ending frame time;

the determining module is specifically configured to generate a fourth candidate object information set according to the target audio data, where the fourth candidate object information set is included in the candidate object information set.

the determining module is specifically used for carrying out voiceprint recognition processing on the target audio data to obtain R dubbing entity words, wherein R is an integer greater than or equal to 1;

For each of the R soundtrack entity words, converting the soundtrack entity word into a character entity word;

and aiming at the role entity word corresponding to each dubbing entity word in the R dubbing entity words, converting the role entity word into an object entity word, and taking the entity word as candidate object information in the fourth candidate object information set.

the acquisition module is also used for acquiring an image detection result through the object detection model based on the target frame image;

the determining module is further configured to determine that N objects to be identified exist in the target frame image if the image detection result includes N detection areas, where each detection area corresponds to one object to be identified, and each detection area corresponds to a set of position parameters.

the determining module is specifically configured to obtain, based on the N detection areas, a target feature vector corresponding to each detection area in the N detection areas through the feature extraction model, so as to obtain N target feature vectors;

Based on the M candidate object information, obtaining candidate feature vectors corresponding to each candidate object information in the M candidate object information through a fifth mapping relation to obtain M candidate feature vectors, wherein the fifth mapping relation is used for representing the mapping relation between the candidate object information and the feature vectors;

and carrying out similarity calculation on the N target feature vectors and the M candidate feature vectors, and generating an object identification result aiming at the target frame image according to the similarity.

In one possible design, in another implementation manner of another aspect of the embodiments of the present application, the object information identifying apparatus further includes a receiving module and a recording module;

the receiving module is used for receiving an object correction request for the object recognition result sent by the terminal after the object recognition result for the target frame image is determined, wherein the object correction request carries a target video identifier, a target frame time and target object information, and the object recognition result comprises the object information to be corrected;

the recording module is used for responding to the object correction request, adding the target video identification, the target frame time and the target object information into a sixth mapping relation, wherein the sixth mapping relation is used for representing the mapping relation among the video identification, the frame time and the object information;

And the recording module is also used for responding to the correction instruction aiming at the target object information and updating the object information to be corrected stored in the database into the target object information.

the receiving module is used for receiving an object correction request sent by the terminal, wherein the object correction request carries a target video identifier, a target frame time, target object information and object information to be corrected;

the recording module is further used for responding to the object correction request, and if the accumulated times of the object correction request are greater than or equal to the times threshold, the object information to be corrected stored in the database is updated into target object information.

the playing module is used for playing the target video on a video playing page, wherein the video playing page is displayed with an object identification control;

the sending module is used for responding to the touch operation aiming at the object identification control and sending an object identification request to the server, wherein the object identification request carries a target video identifier of a target video and a target frame time of a target frame image;

The receiving module is used for receiving an object identification result aiming at the target frame image, which is sent by the server, wherein the object identification result is determined after the server responds to the object identification request, and the object identification result is obtained by adopting the methods provided by the aspects;

and the display module is used for displaying the object identification result on the video playing page.

the display module is also used for displaying an error correction control after displaying the object identification result on the video playing page;

the display module is also used for responding to the touch operation aiming at the error correction control and displaying an object correction area;

the display module is also used for responding to the text input operation aiming at the object correction area and displaying target object information and object information to be corrected;

and the sending module is also used for responding to the object correction instruction and sending an object correction request to the server, wherein the object correction request carries the target video identification, the target frame time and the target object information.

the display module is also used for responding to the touch operation aiming at the error correction control and displaying an object correction area aiming at the object information to be corrected;

the display module is also used for responding to the text input operation aiming at the input area and displaying target object information and object information to be corrected;

the sending module is further configured to respond to an object correction instruction, and send an object correction request to the server, where the object correction request carries a target video identifier, a target frame time, target object information and object information to be corrected.

Another aspect of the present application provides a server, including: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and the processor is used for executing the method provided by the aspects according to the instructions in the program code;

the bus system is used to connect the memory and the processor to communicate the memory and the processor.

Another aspect of the present application provides a terminal, including: a memory, a processor, and a bus system;

Wherein the memory is used for storing programs;

Another aspect of the present application provides a computer-readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

In another aspect of the present application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the above aspects.

From the above technical solutions, the embodiments of the present application have the following advantages:

in the embodiment of the application, a method for identifying object information is provided, first, a target video segment corresponding to a target frame image is acquired, then associated data for the target video segment is acquired, and the associated data includes at least one of target participant information, target interaction information, target subtitle information and target audio data. Based on this, a candidate object information set is determined from the associated data, and if there is at least one object to be identified in the target frame image, an object identification result for the target frame image is determined from the candidate object information set. By the method, before object detection is carried out on the target frame image, the target frame image is utilized to extract the target video segment, and then the associated data corresponding to the target video segment is queried. Based on this, a part of candidate information can be screened out from a large number of candidate information as a candidate information set in combination with the association data. Therefore, the range of the subsequent object matching is narrowed, the recognition speed is improved, and the recognition accuracy can be improved to a certain extent.

Drawings

FIG. 1 is a schematic diagram of a physical architecture of an object information recognition system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an operation architecture of an object information recognition system according to an embodiment of the present application;

fig. 3 is an interface schematic diagram of a video playing scene in the embodiment of the present application;

fig. 4 is another interface schematic diagram of a video playing scene in the embodiment of the present application;

FIG. 5 is a schematic flow chart of an object information identification method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a central control module according to an embodiment of the present application;

FIG. 7 is a flow chart of analyzing interactive information according to an embodiment of the present application;

fig. 8 is a schematic flow chart of analyzing caption information in the embodiment of the present application;

FIG. 9 is a flow chart of analyzing audio data according to an embodiment of the present application;

FIG. 10 is a schematic flow chart of object feature matching in an embodiment of the present application;

FIG. 11 is another flow chart of the object information identifying method according to the embodiment of the present application;

FIG. 12 is a schematic diagram illustrating an interface change of a video playback page according to an embodiment of the present application;

FIG. 13 is another schematic diagram of interface changes of a video playback page according to an embodiment of the present application;

Fig. 14 is another schematic diagram of interface change of a video playing page according to an embodiment of the present application;

FIG. 15 is a schematic diagram of an object information identifying apparatus according to an embodiment of the present application;

FIG. 16 is another schematic diagram of an object information identifying apparatus according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a server according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a terminal in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Face recognition is of great importance in many scenarios today. Face recognition is an algorithm that recognizes the corresponding identity of an input face image. Namely, a face feature is input, and a feature with the highest similarity with the input face feature is found out through comparing the face feature with features corresponding to a plurality of identities in a library one by one. Comparing the highest similarity with a preset threshold, if the similarity is larger than the threshold, returning the identity corresponding to the feature, otherwise, returning the result of 'not in the library'.

It will be appreciated that face recognition involves Computer Vision (CV) techniques in the field of artificial intelligence (Artificial Intelligence, AI). CV is a science of how to make a machine "look at", and more specifically, it means that a camera and a computer are used to replace human eyes to perform machine vision such as recognition and measurement on a target, and further perform graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, CV research-related theory and technology has attempted to build AI systems that can acquire information from images or multidimensional data. CV techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

AI is a theory, method, technique, and application system that utilizes a digital computer or a digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, AI is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. AI is the design principle and the realization method of researching various intelligent machines, and the machines have the functions of perception, reasoning and decision.

AI technology is a comprehensive discipline, and relates to a wide range of technologies, both hardware and software. AI-based technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The AI software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In order to improve the efficiency and accuracy of face recognition, the application provides an object information recognition method, which is applied to an object information recognition system shown in fig. 1, and as shown in the figure, the object information recognition method comprises a server and a terminal, and a client is deployed on the terminal, wherein the client can run on the terminal in a browser mode, can also run on the terminal in an independent Application (APP) mode, and the like, and the specific display mode of the client is not limited herein. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligence platforms. Terminals include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. The number of servers and terminals is not limited. The scheme provided by the application can be independently completed by the terminal, can be independently completed by the server, and can be completed by the cooperation of the terminal and the server.

In connection with the object information recognition system shown in fig. 1, the user illustratively triggers a specific face recognition function (i.e., recognizes which object a face in a video picture is) when an APP installed on the terminal plays a video. At this time, the APP will send the target video identifier and the target frame time of the currently played video to the server through the network for processing, and obtain the processing result returned by the server. After the processing of the server is finished, the object identification result is returned to the APP through the network, and the APP displays the result after receiving the result. The APP is used for interactive operation with the user and display of the object recognition result, such as playing video, triggering the object recognition function, outputting and displaying the object recognition result, reporting active feedback of the user and the like. The APP establishes communication connection with the server through the network.

On the result display interface of the APP, a function of active feedback of the user is further set, namely an error correction control is set on the page, and when the user considers that the object recognition result is incorrect, the user can click on the error correction control and then input the recognition result. Thus, the APP will send the recognition result fed back by the user to the server via the network.

For easy understanding, referring to fig. 2, fig. 2 is a schematic diagram of an operation architecture of an object information identifying system according to an embodiment of the present application, where the object information identifying system includes two parts, namely a terminal and a server, as shown in the drawing, and an APP is run on the terminal. The server comprises 9 modules, namely a central control module, a video information storage module, a face characteristic information storage module, an identification correction information storage module, an interaction information analysis module, a subtitle analysis module, a dubbing analysis module, a face characteristic matching module and an identification correction module.

The APP feeds back the object identification request to the central control module, and the central control module retrieves the current video frame image, the participant information, the interaction information in a period of time before and after the current time, the caption and the target dubbing audio from the video information storage module according to the target video identification and the target frame time of the current playing video. The interactive information sharing module analyzes and processes the interactive information by utilizing natural language processing (Nature Language processing, NLP) technology to obtain object entity words. And the subtitle analysis module analyzes and processes the subtitle text by utilizing an NLP technology to obtain object entity words. The audio analysis module analyzes and processes the audio data by utilizing a voice technology. And then synthesizing the participant information, the interaction information, the subtitle analysis result and the dubbing audio analysis result to obtain a smaller database matching range. And then the face feature matching module extracts the face features of the current video frame image through image processing methods such as deep learning and the like, and extracts corresponding face features from the face feature information storage module based on the obtained database matching range. And the face features extracted from the current video frame image are subjected to similarity matching with the face features extracted from the face feature information storage module, so that the object with the highest similarity is obtained. And if the highest similarity value is larger than a preset threshold value, the object is considered to be the identified specific face result. And finally, the server transmits the object identification result back to the APP, and the APP outputs and displays the result on the terminal for the user to check.

It should be noted that the APP also supports active feedback of the user, and realizes automatic correction of the object recognition result through the recognition correction module. If the user considers that the object recognition result is wrong, the correct result can be fed back, and the APP can send feedback information to the server. After the server is automatically researched and judged, if the feedback accuracy is high, the server enters an identification correction information storage module for correcting the object identification result when the object identification result is identified later.

Among them, NLP is an important direction in the fields of computer science and AI. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge maps, and the like.

Key technologies for Speech technology (Speech Technology) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Based on the above, the object information identification method provided by the application can be applied to different scenes, for example, a user takes an online examination, a video can be shot for identity verification, and at the moment, the user voice and/or the user examination type can be combined, so that the range of user identity matching is narrowed. For another example, in video APP, an object recognition function is provided, and a user can view object information appearing in a video picture during video playback by triggering the function. The following description will be made in connection with specific scenarios.

(1) Live video scenes;

for easy understanding, referring to fig. 3, fig. 3 is a schematic diagram of an interface of a video playing scene in an embodiment of the present application, and as shown in the drawing, taking basketball live broadcast as an example, in a competition process, a server takes a player who plays at a current competition session as target participant information. Thus, the player information corresponding to the target participant information is extracted from the player information database, thereby narrowing the matching range of the player information. Finally, the object recognition results indicated by A1, A2 and A3 are obtained.

(2) A non-live video scene;

for easy understanding, referring to fig. 4, fig. 4 is another interface schematic diagram of a video playing scene in the embodiment of the present application, and as shown in the drawing, taking a television play as an example, in the process of playing the television play, a server extracts part of object information from a database based on cast, interactive information, subtitles, dubbing, and the like, so as to reduce the matching range of the object information. Finally, the object recognition results as indicated by B1 and B2 are obtained.

With reference to the foregoing description, the method for identifying object information in the present application will be described below from the perspective of a server, and referring to fig. 5, one embodiment of the method for identifying object information in the embodiment of the present application includes:

110. acquiring a target video segment corresponding to a target frame image, wherein the target video segment comprises the target frame image;

in one or more embodiments, the object information identifying apparatus acquires a video picture at a certain frame time (for example, 86 th frame) in the target video as the target frame image. Illustratively, when a user triggers an object recognition request (e.g., clicks on an object recognition control, or inputs a voice trigger word, etc.), the trigger time of the object recognition request is taken as the target frame time for determining the target frame image. Based on this, a video clip of a period of time before and after the target frame image can be acquired as a target video clip according to the position of the target frame image in the target video, and it can be seen that the target video clip includes the target frame image.

It may be understood that the object information identifying apparatus provided in the present application may be disposed on a server, or disposed on a terminal, or disposed on a system formed by the server and the terminal, which is not limited in this application.

120. Acquiring associated data corresponding to a target video clip, wherein the associated data comprises at least one of target participant information, target interaction information, target subtitle information and target audio data;

in one or more embodiments, the object information identifying apparatus may extract association data corresponding to the target video clip from a database, wherein the association data includes one or more of target participant information, target interaction information, target subtitle information, and target audio data.

Illustratively, taking the target video as a "drama" as an example, the target participant information may be actor information (e.g., zhang three, lifour). Taking the target video as an example of a "basketball game," the target participant information may be the player information (e.g., jones, thomson). The target interaction information can be subtitles, bullets, comments and the like corresponding to the video.

130. Determining a candidate object information set according to the associated data, wherein the candidate object information set comprises M candidate object information, and M is an integer greater than or equal to 1;

in one or more embodiments, the object information identifying apparatus acquires a candidate object information set including M candidate object information based on the associated data. The candidate object information may be actor names, player names, or the like, and is not limited herein.

140. If N objects to be identified exist in the target frame image, determining an object identification result aiming at the target frame image according to the candidate object information set, wherein N is an integer greater than or equal to 1.

In one or more embodiments, the target frame image is detected, and if no object to be identified is detected, a message of failure of object identification is directly output, or an object identification result is not output. If the N objects to be identified are detected to exist, the candidate object information set is needed to be further utilized to identify each object to be identified, and corresponding identification results are obtained respectively. And synthesizing the identification result of each object to be identified, and generating an object identification result aiming at the target frame image. Wherein the object recognition result includes at least one object information.

It is understood that the object to be identified in the present application may be a human face or an entire person identified from the target frame image, or the like.

In an embodiment of the present application, a method for identifying object information is provided. By the method, before object detection is carried out on the target frame image, the target frame image is utilized to extract the target video segment, and then the associated data corresponding to the target video segment is queried. Based on this, a part of candidate object information can be screened out from a large amount of object information as a candidate object information set in combination with the associated data. Therefore, the range of the subsequent object matching is narrowed, the recognition speed is improved, and the recognition accuracy can be improved to a certain extent.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, before obtaining the target video segment corresponding to the target frame image, another optional embodiment provided in this embodiment of the present application may further include:

acquiring a target video, wherein the target video comprises T frame images, and T is an integer greater than 1;

determining a target frame image from a target video;

the obtaining a target video segment corresponding to the target frame image may specifically include:

if T is greater than or equal to the frame number threshold, extracting multi-frame continuous images containing target frame images from the target video to serve as target video fragments;

In one or more embodiments, a manner of extracting a target video clip is presented. As can be seen from the foregoing embodiments, a frame of a certain period of time before and after the acquisition of the target frame image can be used as the target video clip.

Specifically, the target video includes T frame images, and illustratively, assuming T is 57600 and the frame number threshold is 240. Based on this, at this time, T is greater than or equal to the frame number threshold, and therefore, it is necessary to extract, as a target video clip, a plurality of continuous images including a target frame image from the target video. For example, the first 48 frames and the second 48 frames of the target frame image are taken and together with the target frame image constitute a 97-frame target video clip. For example, assuming that T is 120 and the frame number threshold is 240, T is smaller than the frame number threshold at this time, the target video may be regarded as a target video clip corresponding to the target frame image.

Secondly, in the embodiment of the present application, a way of extracting a target video clip is provided. By the method, the target video segment is formed by extracting continuous multi-frame images before and after the extraction of the target frame images. That is, the target video clip contains the current target frame image, so that the characteristics of the target frame image can be more comprehensively expressed.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in this embodiment of the present application, after determining, according to the candidate object information set, an object recognition result for the target frame image, the method may further include:

when an object identification request sent by a terminal is received, an object identification result is obtained from a database according to the object identification request, wherein the object identification request carries a target video identifier of a target video and a target frame time of a target frame image, and the database records the video identifier, the frame time corresponding to the video identifier and the object identification result corresponding to each frame time;

and sending the object identification result to the terminal so that the terminal displays the object identification result.

In one or more embodiments, a manner of object recognition in an offline scenario is presented. As can be seen from the foregoing embodiments, the offline recognition can recognize each frame of image in the target video, and store the object recognition result in the database, so as to directly extract the object recognition result of a certain frame of image from the database.

In particular, typically, offline recognition may recognize a target video frame-by-frame or select a key frame. Taking as an example the identification of one frame image (i.e., the target frame image). Firstly, a target video identification of a target video where the target frame image is located is required to be obtained, and a target frame time of the target frame image is obtained. Image data of the frame image (i.e., the target frame image) is then acquired in a manner that queries a video information storage module (i.e., a database). For ease of understanding, referring to table 1, table 1 is an illustration of a first mapping relationship in a video information storage module (i.e., database).

TABLE 1

Video identification	Frame time	Image data
			mcz24dfac6d	1527	110010110110…
mcz24dfac6d	1528	100100101010…
			…	…	…
v14dace528a	2665	101111010101…

Where image data refers to binary data of an image, the specific form depends on the image format. It can be seen that, based on the first mapping relationship shown in table 1, the target video identifier is assumed to be "mcz dfac6d", and the target frame time is assumed to be "1528", so that the image data corresponding to the target frame image can be obtained.

Thus, after the object recognition result for the target frame image is obtained, the object recognition result may also be stored in the database, and for ease of understanding, please refer to table 2, table 2 is an illustration of the storage relationship in the database.

TABLE 2

Video identification	Frame time	Object recognition results
			mcz24dfac6d	1527	Without any means for
mcz24dfac6d	1528	Zhang san, lisi
			…	…	…
v14dace528a	2665	Zhang San, wang Wu

When the server receives the object identification request sent by the terminal, the target video identification of the target video and the target frame time of the target frame image can be obtained by analyzing the object identification request. Based on the target video identification and the target frame time, the server queries from the database to obtain a corresponding object identification result. After the server obtains the object identification result, the object identification result is fed back to the terminal and displayed by the terminal.

It should be noted that, in order to obtain a specified frame image of a certain video, a method of extracting image data from a database may be adopted, or a method of directly extracting a corresponding frame image from a video file may be adopted, which is only illustrative and should not be construed as limiting the present application.

In the embodiment of the application, a method for identifying an object in an offline scene is provided. By the method, each frame of image needs to be identified in advance in offline identification, and then the identification result is stored. When the user triggers the recognition function, the stored object recognition result is directly queried. Therefore, the offline identification can avoid repeated identification, and processing resources are saved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the present application, determining the target frame image from the target video may specifically include:

receiving an object identification request sent by a terminal, wherein the object identification request carries a target video identifier of the target video and a target frame time of the target frame image;

may further include:

In one or more embodiments, a manner of object recognition in an online scenario is presented. As can be seen from the foregoing embodiments, the online recognition is performed based on the object recognition request triggered by the user, and after the object recognition result is obtained, the object recognition result is fed back to the terminal and displayed by the terminal.

Specifically, in a normal case, the object identification request sent by the terminal carries the target video identifier of the target video and the target frame time of the target frame image. Based on this, for ease of understanding, please refer to table 1 again, image data of the target frame image may be obtained according to the first mapping relationship shown in table 1. After the object recognition result for the target frame image is obtained, the object recognition result can be fed back to the terminal, and the terminal displays the object recognition result.

In the embodiment of the application, a method for identifying an object in an online scene is provided. Through the mode, the frame image is identified once when the user triggers the function, so that the video is used as required, and the cost performance is higher for videos with low popularity.

Optionally, in another optional embodiment provided in the embodiment of the present application, on the basis of the respective embodiments corresponding to fig. 5, the associated data may include target participant information, target interaction information, target subtitle information, and target audio data;

determining the candidate object information set according to the association data may specifically include:

determining a first candidate object information set according to the target participant information;

In one or more embodiments, a manner of constructing a set of candidate object information based on association data is presented. As can be seen from the foregoing embodiments, the server includes a central control module, and the central control module can obtain the target participant information, the target interaction information, the target subtitle information and the target audio data.

Specifically, for ease of understanding, referring to fig. 6, fig. 6 is a schematic diagram of a working module of the central control module in the embodiment of the present application, and as shown in the drawing, the video information storage module (i.e. database) respectively retrieves the target participant information, the target interaction information, the target subtitle information and the target audio data. Thereby, at least one corresponding candidate object information is obtained, respectively. For ease of understanding, referring to table 3, table 3 is an illustration of the respective candidate information sets.

TABLE 3 Table 3

It can be seen that "Yang Yi, grandchild eight, and Lifour" belong to the first candidate information set. "Lifour" belongs to the second candidate information set. "Suneight, lifour" belongs to the third candidate information set. "Ma six" belongs to the fourth candidate information set. The candidate information is subjected to deduplication processing, so that a candidate information set is obtained, and the candidate information set comprises 'Yang Yi, grandchild eight, plum four and horse six'.

Illustratively, in one specific implementation, a union of the first candidate information set, the second candidate information set, the third candidate information set, and the fourth candidate information set is taken to obtain the candidate information set. Therefore, M pieces of candidate object information contained in the candidate object information set are used as the basis of object matching, so that the aim of reducing the matching range is fulfilled, and the recognition efficiency is improved.

In another specific implementation, the first candidate object information set is first used as a candidate object information set, and object identification is performed on the target frame image based on the candidate object information set. If information for the object cannot be identified, object identification may be further performed on the target frame image based on other candidate object information sets (i.e., at least one of the second candidate object information set, the third candidate object information set, and the fourth candidate object information set). Therefore, on one hand, the method considers that the way of acquiring the first candidate object information set is quicker, and the first candidate object information set is preferentially used as the candidate object information set for object identification, so that the object identification efficiency can be improved. On the other hand, for the situation that the recognition fails, at least one of the second candidate object information set, the third candidate object information set and the fourth candidate object information set can be further utilized for matching, so that the success rate and the accuracy of object information recognition are improved.

In another specific implementation, the first candidate object information set is first used as a candidate object information set, and object identification is performed on the target frame image based on the candidate object information set. If object information cannot be identified, object identification may be further performed on the target frame image based on other candidate object information sets (i.e., at least one of the second candidate object information set, the third candidate object information set, and the fourth candidate object information set). If the object information is still not recognized, the object recognition can be performed on the target frame image by using the whole network data. Therefore, the success rate and accuracy of object information identification can be improved to a greater extent.

Secondly, in the embodiment of the application, a way of constructing a candidate object information set based on the association data is provided. Through the mode, the matching range of the object information can be reduced to a large extent by combining the participant information, the interaction information, the caption information and the audio data, so that the recognition speed can be improved, and the recognition accuracy can be improved to a certain extent.

Optionally, based on the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the present application, acquiring association data for a target video segment may specifically include:

Acquiring a target video identifier corresponding to a target video segment;

and generating a first candidate object information set according to the target participant information, wherein the first candidate object information set is contained in the candidate object information set.

In one or more embodiments, a manner of narrowing down object information using target participant information is presented. As can be seen from the foregoing embodiments, the candidate information set includes a first candidate information set.

Specifically, target participant information of the target video is obtained by querying a video information storage module (i.e., database). For ease of understanding, referring to table 4, table 4 is an illustration of the second mapping relationship in the video information storage module (i.e., database).

TABLE 4 Table 4

Video identification	Participant information
		mcz24dfac6d	Yang Yi, lifour, zhao Wu
…	…
		v14dace528a	Chen Jiu, zhang Er, sun Ba

It can be seen that, based on the second mapping relationship shown in table 4, it is assumed that the target video identifier is "mcz dfac6d", the target participant information is "Yang Yi, li-four, zhao Wu", and thus the resulting first candidate object information set includes "Yang Yi, li-four, zhao Wu".

Next, in the embodiment of the present application, a way to narrow the range of the object information by using the target participant information is provided. Through the mode, the identification speed can be improved to a certain extent, so that the feasibility of the scheme is improved.

acquiring a target video identifier corresponding to a target video segment;

and generating a second candidate object information set according to the target interaction information, wherein the second candidate object information set is contained in the candidate object information set.

In one or more embodiments, a manner of narrowing down object information using targeted interaction information is presented. As can be seen from the foregoing embodiments, the candidate information set includes at least the second candidate information set.

Specifically, the target interaction information of the target video clip is obtained by querying a video information storage module (i.e., a database). For ease of understanding, referring to table 5, table 5 is an illustration of a third mapping relationship in a video information storage module (i.e., database).

TABLE 5

It can be seen that, based on the third mapping relationship shown in table 5, assuming that the target video identifier is "mcz dfac6d", the start frame time of the target video clip is 1663 th frame time, and the end frame time is 1780 th frame time. Thus, the available target interaction information comprises 'Lifour prepared places, haha' and 'Zhao five places'. Thus, the resulting second candidate information set includes "Lifour, zhao Wu".

It should be noted that, the interactive information corresponding to each frame time may be one or more pieces, which is not limited herein.

In the embodiment of the application, a way of reducing the range of the object information by using the target interaction information is provided. Through the mode, the identification speed can be improved to a certain extent, so that the feasibility of the scheme is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the present application, generating the second candidate object information set according to the target interaction information may specifically include:

Word segmentation processing is carried out on the target interaction information to obtain a word set;

In one or more embodiments, a manner of analyzing targeted interaction information is presented. In the foregoing embodiments, the target interaction information may include at least one item of interaction information, and thus, the interaction information may be processed item by item. The interactive information of the targets includes { "how the Lifour is ready to come out, ha, queen wife does not speak true phase }. The description is given as an example.

Specifically, for ease of understanding, referring to fig. 7, fig. 7 is a schematic flow chart of analysis of interactive information in the embodiment of the present application, and as shown in the drawing, word segmentation is performed on target interactive information to obtain a word set. For example, the word set obtained after the word segmentation of "liqua ready out of the field" includes { "liqua", "ready", "out of the field", "haha" }. The word set obtained after word segmentation of the 'how the queen wife does not speak the true phase' comprises { 'the queen wife', 'how', 'no', 'the true phase', 'speak'.

And then analyzing the word set obtained after word segmentation by using an NLP technology, extracting the entity, and extracting P entity words after analysis. Based on the above example, the P entity words include { "litfour", "queen girl" }. Wherein, "Lifour" belongs to the object entity word, and "queen myrtle" belongs to the character entity word. For the object entity word "Lifour," it is directly used as candidate object information in the second candidate object information set. For the role entity word "physalis alkekengi", the role entity word "physalis alkekengi" can be converted through target participant information, that is, the target participant information is queried to obtain that the player of the "physalis alkekengi" is "grand eight", at this time, the role entity word "physalis alkekengi" is converted into the target entity word "grand eight", and then the converted target entity word "Sun Ba" is used as candidate object information in the second candidate object information set.

Finally, the object entity words are used as candidate object information in the second candidate object information set.

It will be appreciated that if the player of the character entity word "physalis alkekengi" cannot be determined from the target participant information, then a search can be conducted across the web based on the character entity word "physalis alkekengi" to obtain one or more object entity words that have the character "physalis alkekengi" exercised. If a plurality of object entity words exist, the film name information, the version information, the director information and the like of the target video can be further combined, so that the range of the object entity words is narrowed. Thereby converting the character entity words into object entity words. Furthermore, the searched object entity words and the corresponding relation between the object entity words and the corresponding role entity word parts can be supplemented into the target participant information.

The NLP technique in the present application refers to a deep learning-based NLP technique, for example, a transformation-based bi-directional encoder representation (BiDirectional Encoder Rpresentation From Transformers, BERT) or Long short-term memory (LSTM) model, and the like, which is not limited herein.

In the embodiment of the application, a method for analyzing the target interaction information is provided. By the method, the entity words can be extracted from the interaction information by utilizing the semantic analysis technology, and finally the entity words are associated to the object entity words. Thus, the target entity word may be directly used as the candidate target information. Thereby ensuring the feasibility and operability of the analysis of the interactive information.

acquiring a target video identifier corresponding to a target video segment;

and generating a third candidate object information set according to the target subtitle information, wherein the third candidate object information set is contained in the candidate object information set.

In one or more embodiments, a manner of narrowing down the scope of object information using target subtitle information is presented. As can be seen from the foregoing embodiments, the candidate information set includes at least a third candidate information set.

Specifically, the target subtitle information of the target video clip is acquired in a manner of querying a video information storage module (i.e., database). For ease of understanding, referring to table 6, table 6 is an illustration of a fourth mapping relationship in a video information storage module (i.e., database).

TABLE 6

Video identification	Frame time	Subtitle information
			mcz24dfac6d	1663	The mother can't stop for a while
mcz24dfac6d	1664	Under the condition that the prince comes over
			…	…	…

It can be seen that, based on the fourth mapping relationship shown in table 6, it is assumed that the target video identifier is "mcz dfac6d", the start frame time of the target video clip is 1663 th frame time, and the end frame time is 1780 th frame time. Thus, the available target caption information includes { "Huangshenniang, you can not rest for a while", "on the performance, taizi turns" … … }. Thereby, the subtitle information can be converted to obtain the third candidate object information set.

Note that, the caption information corresponding to each frame time may be one or more pieces, which is not limited herein.

Next, in the embodiment of the present application, a manner of narrowing down the range of object information using target subtitle information is provided. Through the mode, the identification speed can be improved to a certain extent, so that the feasibility of the scheme is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the present application, generating the third candidate object information set according to the target subtitle information may specifically include:

word segmentation processing is carried out on the target subtitle information to obtain a word set;

In one or more embodiments, a manner of analyzing target subtitle information is presented. As can be seen from the foregoing embodiments, the target subtitle information may include at least one subtitle, and thus, the subtitles may be processed piece by piece. The following will be described with respect to the target caption information including "queen wife, you cannot sit for a while. The description is given as an example.

Specifically, for ease of understanding, referring to fig. 8, fig. 8 is a schematic flow chart of analysis of caption information in the embodiment of the present application, and as shown in the drawing, first, word segmentation is performed on target caption information to obtain a word set. For example, the word set obtained after the word segmentation of "queen wife you cannot sit for a while" includes { "queen wife", "you", "can not sit", "sit for a while", "o" }.

And then analyzing the word set obtained after word segmentation by using an NLP technology, extracting the entity, and extracting Q role entity words after analysis. Based on the above example, the Q character entity words include "queen girl". Thus, the target participant information can be converted, that is, the player who obtains the queen myrtle is grand eight by querying the target participant information, and at the moment, the character entity word queen myrtle is converted into the character entity word grand eight.

Finally, the object entity word Sun Ba obtained after conversion is used as candidate object information in the third candidate object information set.

Note that, the NLP technique in the present application refers to an NLP technique based on deep learning, for example, using BERT or LSTM model, and the present invention is not limited thereto.

Again, in the embodiment of the present application, a manner of analyzing target subtitle information is provided. By the method, the character entity words can be extracted from the subtitle information by utilizing the semantic analysis technology, and finally the character entity words are associated to the object entity words. Thus, the target entity word may be set as candidate target information. Thereby ensuring the feasibility and operability of the subtitle information analysis.

acquiring a starting frame time and an ending frame time of a target video segment;

and generating a fourth candidate object information set according to the target audio data, wherein the fourth candidate object information set is contained in the candidate object information set.

In one or more embodiments, a manner of narrowing down object information using target audio data is presented. As can be seen from the foregoing embodiments, the candidate information set includes at least a fourth candidate information set.

Specifically, overall dubbing audio is extracted from a target video. Based on this, first, the start frame time and the end frame time of the target video clip are acquired, and then dubbing audio between the start frame time and the end frame time is grasped from the overall dubbing audio as target audio data. And then, carrying out voiceprint analysis processing on the target audio data to finally obtain a fourth candidate object information set.

Next, in the embodiment of the present application, a way to narrow down the range of object information using target audio data is provided. Through the mode, the identification speed can be improved to a certain extent, so that the feasibility of the scheme is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, generating the fourth candidate object information set according to the target audio data may specifically include:

performing voiceprint recognition processing on the target audio data to obtain R dubbing entity words, wherein R is an integer greater than or equal to 1;

In one or more embodiments, a manner of analyzing target audio data is presented. As can be seen from the foregoing embodiments, the voiceprint recognition processing is performed on the target audio data. Since the vocal organs (tongue, teeth, throat, lungs, nasal cavities, etc.) used by a person when speaking vary greatly in size and morphology, different subjects can be distinguished by voiceprint recognition techniques.

Specifically, for ease of understanding, referring to fig. 9, fig. 9 is a schematic flow chart of analyzing audio data in the embodiment of the present application, and as shown in the drawing, firstly, voice print feature extraction is performed on target audio data by using an audio analysis technology, and then, according to the voice print feature, a corresponding dubbing person is identified, that is, R dubbing entity words are obtained. For example, the R dubbing entity words include "quaternary three" and "Hou Qi". Based on the information, the dubbing person can find the dubbed role by inquiring the role audio information. Namely, the character entity words corresponding to the dubbing entity words are determined. For ease of understanding, assume that the target video is identified as "mcz dfac6d", based on which reference is made to table 7, table 7 is an illustration of the relationship among the roles, dubbing staff and participant information in the video information storage module (i.e., database).

TABLE 7

Video identification	Roles and roles	Dubbing staff	Participant information
				mcz24dfac6d	Queen wife	Quaternary III	Sun Ba
mcz24dfac6d	Radix seu herba Heterophyllae	Zhao Wu	Zhao Wu
				mcz24dfac6d	Protective armor	Li Si	Li Si
…	…	…	…

Based on this, for example, the dubbing entity word "quaternary three" corresponds to the character entity word "queen girl". Thus, the target participant information is queried to find the participant of the role, namely, the object entity words corresponding to the role entity words are determined, for example, "grandchild eight".

And finally, taking the object entity words obtained after conversion as candidate object information in the fourth candidate object information set.

It will be appreciated that if the player of the character entity word "physalis alkekengi" cannot be determined from the target participant information, then a search can be conducted across the web based on the character entity word "physalis alkekengi" to obtain one or more object entity words that have the character "physalis alkekengi" exercised. If a plurality of object entity words exist, the film name information, the version information, the director information and the like of the target video can be further combined, so that the range of the object entity words is narrowed. Thereby converting the character entity words into object entity words.

Again, in the embodiments of the present application, a way of analyzing target audio data is provided. By the method, the role entity words can be extracted from the audio data by utilizing the voiceprint analysis technology, and finally the role entity words are associated to the object entity words. Thus, the target entity word may be set as candidate target information. Thereby ensuring the feasibility and operability of the audio data analysis.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, another optional embodiment provided in the embodiment of the present application may further include:

acquiring an image detection result through an object detection model based on the target frame image;

if the image detection result includes N detection areas, determining that N objects to be identified exist in the target frame image, where each detection area corresponds to one object to be identified, and each detection area corresponds to a set of position parameters.

In one or more embodiments, a method of detecting an object to be identified is presented. As can be seen from the foregoing embodiments, before performing object recognition, it is necessary to perform object detection, extract a region where an object is located in an image, and then further recognize the object in the region.

Specifically, taking a target frame image as an example, the target frame image is input to a trained object detection model, and an image detection result is output by the object detection model. If the image detection result indicates no detection result, no object recognition is required. If the image detection result indicates that at least one detection area (i.e., N detection areas) exists, it is determined that at least one object to be recognized (i.e., N objects to be recognized) exists in the target frame image. The detection areas are also understood to be Bounding boxes, and each detection area corresponds to a set of position parameters. Each set of position parameters typically includes 4 parameters, namely, the abscissa, the ordinate, the width, and the height of a particular point (e.g., a center point) in the detection zone.

Based on this, the specific position of the object to be identified in the target frame image can be known from the detection area.

It should be noted that the object detection model may be a Region-based convolutional network (Region-based Convolutional Network, R-CNN) model, or a single-shot multi-box detector (Single Shot MultiBox Detector, SSD), or a you look only once (You Only Look Once, YOLO) model, etc., which is not limited herein.

Secondly, in the embodiment of the application, a manner of detecting the object to be identified is provided. According to the method, before the object recognition is carried out, the detection area for carrying out subsequent recognition is extracted based on the object detection model, so that only the detection area is subjected to object recognition, the calculated amount is reduced, and the calculation resources are saved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, in another optional embodiment provided in the present application, determining, according to the candidate object information set, an object recognition result for the target frame image may specifically include:

based on the N detection areas, obtaining target feature vectors corresponding to each detection area in the N detection areas through a feature extraction model so as to obtain N target feature vectors;

In one or more embodiments, a manner of generating object recognition results is presented. As can be seen from the foregoing embodiments, for each detection area, feature extraction is required, and then similarity calculation is performed with candidate feature vectors stored in a video information storage module (i.e., database), so as to finally obtain an object recognition result.

Specifically, for ease of understanding, referring to fig. 10, fig. 10 is a schematic flow chart of object feature matching in the embodiment of the present application, and the process as shown in the figure is as follows:

in step C1, target participant information is acquired, that is, corresponding participant information is queried through the target video identification, and the participant information is taken as target participant information.

In step C2, M pieces of candidate object information included in the candidate object information set are acquired, where the M pieces of candidate object information are determined identification ranges.

In step C3, candidate feature vectors corresponding to each candidate object information are obtained through a fifth mapping relationship based on the M candidate object information, so as to obtain face feature data of all objects in the current recognition range. For ease of understanding, referring to table 8, table 8 is an illustration of the fifth mapping relationship in the video information storage module (i.e., database).

TABLE 8

Object identification	Object information	Feature vector
			86619	Sun Ba	(0.61,0.25,0.39,0.55,0.49,0.47,0.95,…)
85132	Li Si	(0.33,0.17,0.22,0.87,0.91,0.10,0.27,…)
			…	…	…

Assuming that the M candidate object information is "Sun Ba" and "litu", the candidate feature vector of each candidate object information can be obtained based on the fifth mapping relationship shown in table 6.

In step C4, based on the N detection regions, a target feature vector corresponding to each of the N detection regions is obtained by the feature extraction model. That is, a deep learning algorithm (for example, a convolutional neural network) is used to extract the face features from the input detection region.

In step C5, the N target feature vectors and the M candidate feature vectors are calculated to be similar to each other, for example, N is 3 and M is 4, so that 12 similarity can be obtained. For each target feature vector, the candidate feature vector with the highest similarity is taken, whether the similarity is larger than a preset threshold value or not is judged, if so, the matching is successful, and then, the object recognition result is determined to comprise the object information corresponding to the candidate feature vector.

In step C6, according to the target video identifier and the target frame time, whether there is a correct recognition result actively fed back by the user is queried, and if there is, the recognition result is used as a final object recognition result.

In step C7, the final object recognition result is returned to the terminal.

Again, in the embodiment of the present application, a manner of generating an object recognition result is provided. In the above manner, the feature vectors of the objects identified in the image can be calculated by using the feature vectors stored in the video information storage module (i.e., database), so as to match the object information. Thereby, the feasibility and operability of the scheme is increased.

receiving an object correction request sent by a terminal and aiming at an object recognition result, wherein the object correction request carries a target video identifier, a target frame moment and target object information, and the object recognition result comprises the object information to be corrected;

responding to the object correction request, adding the target video identification, the target frame time and the target object information into a sixth mapping relation, wherein the sixth mapping relation is used for representing the mapping relation among the video identification, the frame time and the object information;

And in response to a correction instruction for the target object information, updating the object information to be corrected stored in the database into the target object information.

In one or more embodiments, a way of performing a correction based on an object recognition result is presented. As can be seen from the foregoing embodiments, after displaying the object recognition result, the user is also supported to report error correction information. For example, when playing a frame of picture in the target video (for example, a target frame image corresponding to the target frame time), the user triggers an object correction request through the terminal, where the object correction request carries the target video identifier, the target frame time and the target object information.

Specifically, for ease of understanding, assume that the target video flag in the object correction request triggered by the user a is "mcz dfac6d", the target frame time is "6712", and the target object information is "poplar". Based on this, referring to table 9, table 9 is an illustration of a sixth mapping relationship in the video information storage module (i.e., database).

TABLE 9

Video identification	Frame time	Object information
			mcz24dfac6d	6712	Yang Yi
mcz24dfac6d	8356	Zhao Wu
			…	…	…

It can be seen that the target video identification, the target frame time and the target object information are added to the sixth mapping relationship in the video information storage module (i.e., database).

By querying a video information storage module (i.e., a database), a background staff can determine whether an object to be corrected corresponding to the target frame image needs to be updated to target object information. One way is to directly update the object information to be corrected stored in the database to the target object information. In another way, the mapping relation among the target video identification, the target frame time and the target object information is not changed, and when the object identification is carried out subsequently, the result can be called as a final object identification result.

It will be appreciated that the target object information reported by the user may have additional characters, punctuation marks, spaces, or mispronounced words, and therefore, the target object information may need to be normalized before being added to the database.

Next, in the embodiment of the present application, a way of performing correction based on the object recognition result is provided. By the method, background staff can audit the object information reported by the user, and the object information reported by the user is not updated, so that the accuracy of the error correction of the object information is improved.

Receiving an object correction request sent by a terminal, wherein the object correction request carries a target video identifier, a target frame time, target object information and object information to be corrected;

and responding to the object correction request, and if the accumulated times of the object correction request are greater than or equal to a time threshold, updating the object information to be corrected stored in the database into target object information.

In one or more embodiments, another way of performing correction based on object recognition results is presented. As can be seen from the foregoing embodiments, after displaying the object recognition result, the user is also supported to report error correction information.

Specifically, W object correction requests are received, and each object correction request carries a target video identifier, a target frame time, target object information, and object information to be corrected. The target object information is the object information considered to be correct by the user, and the object information to be corrected is the object information belonging to the object recognition result. W represents the cumulative number of object correction requests, and if the cumulative number of object correction requests is greater than or equal to the number threshold, it means that there have been enough objects present in the target frame image to which the user believes the target frame time corresponds, should be the target object information, not the object information to be corrected.

One way is to directly update the object information to be corrected stored in the database to the target object information. In another way, the mapping relation among the target video identification, the target frame time and the target object information is not changed, and when the object identification is carried out subsequently, the result can be called as a final object identification result.

Next, in the embodiment of the present application, another way of performing correction based on the object recognition result is provided. By the method, the report result correction object information of a large number of users can be integrated. Therefore, under the condition of ensuring certain accuracy, the aim of automatically correcting the object information can be fulfilled, and the time cost of manpower auditing are saved.

With reference to the foregoing description, a method for identifying object information in the present application will be described below from the perspective of a terminal, and referring to fig. 11, an embodiment of the method for identifying object information in the embodiment of the present application includes:

210. the terminal plays the target video on a video playing page, wherein the video playing page displays an object identification control;

in one or more embodiments, a target video is played on a video play page provided by a terminal. The video playing page is also displayed with an object identification control.

220. The terminal responds to touch operation aiming at an object identification control, and sends an object identification request to a server, wherein the object identification request carries a target video identifier of a target video and a target frame moment of a target frame image;

in one or more embodiments, if the user triggers a touch operation for the object recognition control when the target video is played to a certain screen (e.g., a target frame image), i.e., an object recognition request is sent to the server. The object identification request carries a target video identification of a target video and a target frame time of a target frame image.

Specifically, for ease of understanding, referring to fig. 12, fig. 12 is a schematic diagram illustrating an interface change of a video playing page in an embodiment of the present application, and as shown in fig. 12 (a), D1 is used to indicate an object recognition control. After clicking the object recognition control, the user sends an object recognition request to the server.

230. The terminal receives an object recognition result for the target frame image sent by the server, wherein the object recognition result is determined after the server responds to the object recognition request, and the object recognition result is obtained by adopting the method provided by the embodiment;

In one or more embodiments, the server may obtain, in response to the object recognition request, an object recognition result corresponding to the target frame image in an online recognition or offline recognition manner. Thereby, the object recognition result is transmitted to the terminal.

240. And the terminal displays the object identification result on the video playing page.

In one or more embodiments, the terminal displays the object recognition result on the video play page.

Specifically, as shown in fig. 12 (B), D2 is used to indicate the object recognition result. The object recognition result includes recognized object information, such as Zhang three and Lifour, and further, the position of the object in the target frame image, specific information of the object, and the like can be circled.

It should be noted that the types and arrangements of the elements on the interface shown in fig. 12 are only illustrative, and should not be construed as limiting the present application.

In an embodiment of the application, an object information identification method is provided. By the mode, the object recognition function has high feasibility and application value under the condition of improving the object recognition speed and accuracy.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 11, another optional embodiment provided in this embodiment of the present application may further include, after displaying the object identification result on the video playing page:

Displaying an error correction control;

responding to touch operation for the error correction control, and displaying an object correction area;

displaying target object information and object information to be corrected in response to a text input operation for the object correction area;

and responding to the object correction instruction, and sending an object correction request to the server, wherein the object correction request carries the target video identification, the target frame time and the target object information.

In one or more embodiments, a manner is presented to support user results correction. As can be seen from the foregoing embodiments, the automatic correction of the object recognition result is achieved by supporting active feedback of the user. If the user considers that the object identification result is wrong, the user can feed back the correct result by triggering the touch operation for the error correction control, and the terminal can send feedback information to the server. If the feedback information is accurate, the method enters a correction database (or updates target data) for correcting the recognition result (or directly obtaining the correct result) when the recognition is carried out later.

Specifically, for ease of understanding, please refer to fig. 13, fig. 13 is another schematic diagram of interface change of a video playing page in the embodiment of the present application, and as shown in fig. 13 (a), E1 is used to indicate an error correction control. When the user clicks the error correction control indicated by E1, the interface shown in fig. 13 (B) can be entered. At this time, the object correction area indicated by E2 is displayed, and the user can input text content, that is, input target object information, for example, "Zhang Sany". When the user clicks the reporting control indicated by E3, an object correction instruction is triggered, and therefore the terminal sends an object correction request to the server, wherein the object correction request carries a target video identifier, a target frame time and target object information.

It should be noted that the types and arrangements of the elements on the interface shown in fig. 13 are only illustrative, and should not be construed as limiting the present application.

Secondly, in the embodiment of the application, a way for supporting the user to correct the result is provided. In the above manner, if the user considers that the object recognition result is erroneous, the error correction process can be actively performed thereon. Thereby, a viable implementation is provided for the solution.

displaying an error correction control;

responding to touch operation for an error correction control, and displaying an object correction area for object information to be corrected;

displaying target object information and object information to be corrected in response to a text input operation for an input area;

and responding to the object correction instruction, and sending an object correction request to the server, wherein the object correction request carries the target video identification, the target frame time, the target object information and the object information to be corrected.

In one or more embodiments, another way of supporting user results correction is presented. As can be seen from the foregoing embodiments, the foregoing embodiments enable automatic correction of the object recognition result by supporting active feedback of the user. If the user considers that the object identification result is wrong, the user can feed back the correct result by triggering the touch operation for the error correction control, and the terminal can send feedback information to the server. If the feedback information is accurate, the method enters a correction database (or updates target data) for correcting the recognition result (or directly obtaining the correct result) when the recognition is carried out later.

Specifically, for ease of understanding, referring to fig. 14, fig. 14 is another schematic diagram of interface changes of a video playing page in the embodiment of the present application, and as shown in fig. 14 (a), F1 is used to indicate an error correction control. When the user clicks the error correction control indicated by F1, the interface shown in fig. 14 (B) can be entered. At this time, an input area (or a selection area) of the object information to be corrected indicated by F2 is displayed, which supports the user to input the object information to be corrected, for example, "litu". At the same time, an object correction area of the object information to be corrected indicated by F3 is displayed, which area supports the user to input the target object information, for example, "Zhang Sano".

When the user clicks the reporting control indicated by F4, an object correction instruction is triggered, and therefore, the terminal sends an object correction request to the server, wherein the object correction request carries a target video identifier, a target frame time, target object information and object information to be corrected.

It should be noted that the types and arrangements of the elements on the interface shown in fig. 14 are only illustrative, and should not be construed as limiting the present application.

Secondly, in the embodiment of the application, another way of supporting the user to correct the result is provided. In this way, the user can select the object information to be corrected and then perform error correction processing on it. Thereby, a viable implementation is provided for the solution.

Referring to fig. 15, fig. 15 is a schematic diagram showing an embodiment of the object information identifying apparatus according to the embodiment of the present application, the object information identifying apparatus 30 includes:

an obtaining module 310, configured to obtain a target video segment corresponding to a target frame image, where the target video segment includes the target frame image;

the obtaining module 310 is further configured to obtain association data for the target video segment, where the association data includes at least one of target participant information, target interaction information, target subtitle information, and target audio data;

a determining module 320, configured to determine a candidate object information set according to the association data, where the candidate object information set includes M candidate object information, and M is an integer greater than or equal to 1;

the determining module 320 is further configured to determine, according to the candidate object information set, an object recognition result for the target frame image if N objects to be recognized exist in the target frame image, where N is an integer greater than or equal to 1.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, before the object detection is carried out on the target frame image, the target frame image is utilized to extract the target video segment, and then the associated data corresponding to the target video segment is inquired. Based on this, a part of candidate information can be screened out from a large number of candidate information as a candidate information set in combination with the association data. Therefore, the range of the subsequent object matching is narrowed, the recognition speed is improved, and the recognition accuracy can be improved to a certain extent.

Alternatively, on the basis of the embodiment corresponding to fig. 15 described above, in another embodiment of the object information identifying apparatus 30 provided in the embodiment of the present application,

the obtaining module 310 is further configured to obtain a target video before obtaining a target video segment corresponding to the target frame image, where the target video includes a T frame image, and T is an integer greater than 1;

a determining module 320, configured to determine a target frame image from the target video;

the obtaining module 310 is specifically configured to extract, from the target video, a plurality of continuous images including the target frame image as a target video segment if T is greater than or equal to the frame number threshold;

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the continuous multi-frame images before and after the extraction based on the target frame images jointly form the target video segment. That is, the target video clip contains the current target frame image, so that the characteristics of the target frame image can be more comprehensively expressed.

Optionally, in another embodiment of the object information identifying apparatus 30 provided in the embodiment of the present application, based on the embodiment corresponding to fig. 15, the object information identifying apparatus 30 further includes a sending module 330;

The obtaining module 310 is further configured to, after determining an object recognition result for the target frame image according to the candidate object information set, obtain the object recognition result from a database according to the object recognition request when receiving the object recognition request sent by the terminal, where the object recognition request carries a target video identifier of the target video and a target frame time of the target frame image, and the database records the video identifier, a frame time corresponding to the video identifier, and an object recognition result corresponding to each frame time;

and a sending module 330, configured to send the object identification result to the terminal, so that the terminal displays the object identification result.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the offline identification needs to identify each frame of image in advance, and then the identification result is stored. When the user triggers the recognition function, the stored object recognition result is directly queried. Therefore, the offline recognition is to avoid repeated recognition, and processing resources are saved.

The determining module 320 is specifically configured to receive an object identification request sent by a terminal, where the object identification request carries a target video identifier of the target video and a target frame time of the target frame image;

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the frame image is identified once when the user triggers the function, so that the device can be used as required, and the cost performance is higher for videos with low popularity.

Optionally, in another embodiment of the object information identifying apparatus 30 provided in the embodiment of the present application, on the basis of the embodiment corresponding to fig. 15, the associated data includes target participant information, target interaction information, target subtitle information, and target audio data;

a determining module 320, specifically configured to determine a first candidate object information set according to the target participant information;

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the matching range of the object information can be greatly reduced by combining the participant information, the interaction information, the caption information and the audio data, so that the recognition speed can be improved, and the recognition accuracy can be improved to a certain extent.

the obtaining module 310 is specifically configured to obtain a target video identifier corresponding to the target video segment;

The determining module 320 is specifically configured to generate a first candidate object information set according to the target participant information, where the first candidate object information set is included in the candidate object information set.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the identification speed can be improved to a certain extent, so that the feasibility of the scheme is improved.

the determining module 320 is specifically configured to generate a second candidate object information set according to the target interaction information, where the second candidate object information set is included in the candidate object information set.

the determining module 320 is specifically configured to perform word segmentation on the target interaction information to obtain a word set;

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the entity words can be extracted from the interaction information by utilizing the semantic analysis technology, and finally the entity words are related to the object entity words. Thus, the target entity word may be directly used as the candidate target information. Thereby ensuring the feasibility and operability of the analysis of the interactive information.

the determining module 320 is specifically configured to generate a third candidate information set according to the target subtitle information, where the third candidate information set is included in the candidate information set.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the character entity words can be extracted from the caption information by utilizing a semantic analysis technology, and finally the character entity words are associated to the object entity words. Thus, the target entity word may be set as candidate target information. Thereby ensuring the feasibility and operability of the subtitle information analysis.

the acquiring module 310 is specifically configured to acquire a start frame time and an end frame time of the target video segment;

the determining module 320 is specifically configured to generate a fourth candidate information set according to the target audio data, where the fourth candidate information set is included in the candidate information set.

the determining module 320 is specifically configured to perform voiceprint recognition processing on the target audio data to obtain R dubbing entity words, where R is an integer greater than or equal to 1;

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the role entity words can be extracted from the audio data by utilizing the voiceprint analysis technology, and finally the role entity words are associated to the object entity words. Thus, the target entity word may be set as candidate target information. Thereby ensuring the feasibility and operability of the audio data analysis.

the obtaining module 310 is further configured to obtain an image detection result through the object detection model based on the target frame image;

the determining module 320 is further configured to determine that N objects to be identified exist in the target frame image if the image detection result includes N detection areas, where each detection area corresponds to one object to be identified, and each detection area corresponds to a set of position parameters.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, before the object recognition is carried out, the detection area for carrying out subsequent recognition is extracted based on the object detection model, so that the object recognition is only carried out on the detection area, the calculated amount is reduced, and the calculation resource is saved.

the determining module 320 is specifically configured to obtain, based on the N detection areas, a target feature vector corresponding to each detection area in the N detection areas through a feature extraction model, so as to obtain N target feature vectors;

In an embodiment of the present application, an object information identification apparatus is provided. With the above device, the feature vector of the object identified in the image can be calculated by using the feature vector stored in the video information storage module (i.e., database), so as to match the object information. Thereby, the feasibility and operability of the scheme is increased.

Optionally, in another embodiment of the object information identifying apparatus 30 provided in the embodiment of the present application, based on the embodiment corresponding to fig. 15, the object information identifying apparatus 30 further includes a receiving module 340 and a recording module 350;

a receiving module 340, configured to receive an object correction request for an object recognition result sent by a terminal after determining the object recognition result for the target frame image, where the object correction request carries a target video identifier, a target frame time and target object information, and the object recognition result includes object information to be corrected;

The recording module 350 is configured to add the target video identifier, the target frame time and the target object information to a sixth mapping relationship in response to the object correction request, where the sixth mapping relationship is used to represent a mapping relationship among the video identifier, the frame time and the object information;

the recording module 350 is further configured to update the object information to be corrected stored in the database to the target object information in response to the correction instruction for the target object information.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, background staff can audit the object information reported by the user, and the object information reported in error is not updated, so that the accuracy of error correction of the object information is improved.

the receiving module 340 is configured to receive an object correction request sent by a terminal, where the object correction request carries a target video identifier, a target frame time, target object information, and object information to be corrected;

the recording module 350 is further configured to respond to the object correction request, and update the object information to be corrected stored in the database to the target object information if the cumulative number of the object correction requests is greater than or equal to the number threshold.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the report result correction object information of a large number of users can be synthesized. Therefore, under the condition of ensuring certain accuracy, the aim of automatically correcting the object information can be fulfilled, and the time cost of manpower auditing are saved.

Referring to fig. 16, fig. 16 is a schematic diagram illustrating an embodiment of an object information identifying apparatus according to an embodiment of the present application, and an object information identifying apparatus 40 includes:

the playing module 410 is configured to play a target video on a video playing page, where the video playing page displays an object identification control;

the sending module 420 is configured to send an object identification request to the server in response to a touch operation for the object identification control, where the object identification request carries a target video identifier of a target video and a target frame time of a target frame image;

a receiving module 430, configured to receive an object recognition result sent by the server and specific to the target frame image, where the object recognition result is determined after the server responds to the object recognition request, and the object recognition result is obtained by using the methods provided in the above aspects;

And the display module 440 is used for displaying the object recognition result on the video playing page.

In an embodiment of the present application, an object information identification apparatus is provided. By adopting the device, the object recognition function has higher feasibility and application value under the condition of improving the object recognition speed and accuracy.

Alternatively, on the basis of the embodiment corresponding to fig. 16 described above, in another embodiment of the object information identifying apparatus 40 provided in the embodiment of the present application,

the display module 440 is further configured to display an error correction control after displaying the object identification result on the video playing page;

the display module 440 is further configured to display an object correction area in response to a touch operation for the error correction control;

a display module 440 for displaying the target object information and the object information to be corrected in response to a text input operation for the object correction area;

the sending module 420 is further configured to send an object correction request to the server in response to the object correction instruction, where the object correction request carries the target video identifier, the target frame time and the target object information.

In an embodiment of the present application, an object information identification apparatus is provided. With the device, if the user considers the object recognition result to be wrong, the user can actively perform error correction processing on the object recognition result. Thereby, a viable implementation is provided for the solution.

Optionally, in another embodiment of the object information identifying apparatus 30 provided in the embodiment of the present application, on the basis of the embodiment corresponding to fig. 15, the object information identifying apparatus 30 further includes a recording module 350;

the display module 440 is further configured to display an object correction area for the object information to be corrected in response to a touch operation for the error correction control;

a display module 440 for displaying the target object information and the object information to be corrected in response to a text input operation for the input area;

the sending module 420 is further configured to send an object correction request to the server in response to the object correction instruction, where the object correction request carries the target video identifier, the target frame time, the target object information, and the object information to be corrected.

In an embodiment of the present application, an object information identification apparatus is provided. With the above apparatus, the user can select the object information to be corrected and then perform error correction processing on it. Thereby, a viable implementation is provided for the solution.

Fig. 17 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 500 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 522 (e.g., one or more processors) and memory 532, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 542 or data 544. Wherein memory 532 and storage medium 530 may be transitory or persistent. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 522 may be configured to communicate with a storage medium 530 and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input/output interfaces 558, and/or one or more operating systems 541, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 17.

Fig. 18 is a schematic diagram of a terminal structure provided in the embodiment of the present application, and as shown in fig. 18, only a portion related to the embodiment of the present application is shown for convenience of explanation, and specific technical details are not disclosed, please refer to a method portion in the embodiment of the present application. In the embodiment of the present application, a terminal is taken as a smart phone as an example to describe:

fig. 18 is a block diagram illustrating a part of a structure of a smart phone related to a terminal provided in an embodiment of the present application. Referring to fig. 18, the smart phone includes: radio Frequency (RF) circuitry 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuitry 660, wireless fidelity (wireless fidelity, wiFi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the smartphone structure shown in fig. 18 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes each component of the smart phone in detail with reference to fig. 18:

the RF circuit 610 may be configured to receive and transmit signals during a message or a call, and in particular, receive downlink information of a base station and process the downlink information with the processor 680; in addition, the data of the design uplink is sent to the base station. Typically, the RF circuitry 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS), and the like.

The memory 620 may be used to store software programs and modules, and the processor 680 may perform various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 630 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the smart phone. In particular, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 631 or thereabout using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 680 and can receive commands from the processor 680 and execute them. In addition, the touch panel 631 may be implemented in various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a mouse, joystick, etc.

The display unit 640 may be used to display information input by a user or information provided to the user and various menus of the smart phone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 631 may cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or thereabout, the touch panel 631 is transferred to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 18, the touch panel 631 and the display panel 641 are two independent components to implement the input and input functions of the smart phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the smart phone.

The smartphone may also include at least one sensor 650, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 641 and/or backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for identifying the application of the gesture of the smart phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the smart phone are not described in detail herein.

Audio circuitry 660, speaker 661, microphone 662 may provide an audio interface between a user and the smartphone. The audio circuit 660 may transmit the received electrical signal converted from audio data to the speaker 661, and the electrical signal is converted into a sound signal by the speaker 661 to be output; on the other hand, microphone 662 converts the collected sound signals into electrical signals, which are received by audio circuit 660 and converted into audio data, which are processed by audio data output processor 680 for transmission to, for example, another smart phone via RF circuit 610, or which are output to memory 620 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a smart phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 670, so that wireless broadband Internet access is provided for the user. Although fig. 18 shows a WiFi module 670, it is understood that it does not belong to the essential constitution of a smart phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

Processor 680 is a control center of the smartphone, connects various parts of the entire smartphone with various interfaces and lines, performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in memory 620, and invoking data stored in memory 620. Optionally, processor 680 may include one or more processing units; alternatively, processor 680 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 680.

The smartphone also includes a power supply 690 (e.g., a battery) for powering the various components, optionally logically connected to the processor 680 through a power management system, so as to perform charge, discharge, and power consumption management functions via the power management system.

Although not shown, the smart phone may further include a camera, a bluetooth module, etc., which will not be described herein.

The steps performed by the terminal in the above embodiments may be based on the terminal structure shown in fig. 18.

Also provided in embodiments of the present application is a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the methods as described in the foregoing embodiments.

Also provided in embodiments of the present application is a computer program product comprising a program which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It will be appreciated that in the specific embodiments of the present application, related data such as user information, facial features, etc. are referred to, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use, and processing of related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for identifying object information, comprising:

acquiring associated data for the target video segment, wherein the associated data comprises at least one of target participant information, target interaction information, target subtitle information and target audio data;

2. The method for identifying as in claim 1, wherein prior to the capturing the target video segment corresponding to the target frame image, the method further comprises:

determining the target frame image from the target video;

the obtaining the target video segment corresponding to the target frame image includes:

if the T is greater than or equal to a frame number threshold, extracting multi-frame continuous images containing the target frame images from the target video to serve as the target video fragments;

and if the T is smaller than the frame number threshold, taking the target video as the target video segment corresponding to the target frame image.

3. The method of identifying of claim 2, wherein said determining the target frame image from the target video comprises:

responding to the object identification request, and acquiring the target frame image through a first mapping relation based on the target video identification and the target frame time, wherein the first mapping relation is used for representing the mapping relation among the video identification, the frame time and the image data;

The method further comprises the steps of:

4. The recognition method according to claim 1, wherein the associated data includes target participant information, target interaction information, target subtitle information, and target audio data;

the determining a candidate object information set according to the association data comprises the following steps:

and performing de-duplication processing on the first candidate object information set, the second candidate object information set, the third candidate object information set and the fourth candidate object information set to obtain the candidate object information set.

5. The method of any one of claims 1 to 4, wherein the acquiring the associated data for the target video segment comprises:

Acquiring a target video identifier corresponding to the target video segment;

acquiring the target participant information through a second mapping relation based on the target video identification, wherein the second mapping relation is used for representing the mapping relation between the video identification and the participant information;

6. The method of any one of claims 1 to 4, wherein the acquiring the associated data for the target video segment comprises:

acquiring a target video identifier corresponding to the target video segment;

acquiring a frame time corresponding to each frame of image in the target video segment;

acquiring the target interaction information through a third mapping relation based on the target video identification and the frame time corresponding to each frame of image in the target video fragment, wherein the third mapping relation is used for representing the mapping relation among the video identification, the frame time and the interaction information;

7. The method of claim 6, wherein generating a second set of candidate object information from the target interaction information comprises:

obtaining P entity words from the word set, wherein P is an integer greater than or equal to 1;

for each entity word in the P entity words, if the entity word belongs to the object entity word, using the entity word as candidate object information in the second candidate object information set;

and for each entity word in the P entity words, if the entity word belongs to a role entity word, converting the role entity word into a target entity word, and taking the target entity word as candidate object information in the second candidate object information set.

8. The method of any one of claims 1 to 4, wherein the acquiring the associated data for the target video segment comprises:

Acquiring a target video identifier corresponding to the target video segment;

acquiring the target subtitle information through a fourth mapping relation based on the target video identifier and the frame time corresponding to each frame of image in the target video segment, wherein the fourth mapping relation is used for representing the mapping relation among the video identifier, the frame time and the subtitle information;

9. The method of identifying according to claim 8, wherein the generating a third candidate object information set from the target subtitle information includes:

acquiring Q role entity words from the word set, wherein Q is an integer greater than or equal to 1;

and converting the role entity words into object entity words aiming at each role entity word in the Q role entity words, and taking the object entity words as candidate object information in the third candidate object information set.

10. The method of any one of claims 1 to 4, wherein the acquiring the associated data for the target video segment comprises:

acquiring a start frame time and an end frame time of the target video segment;

11. The method of identifying of claim 10, wherein generating a fourth set of candidate object information from the target audio data comprises:

converting the dubbing entity words into character entity words aiming at each dubbing entity word in the R dubbing entity words;

12. The identification method of claim 1, wherein the method further comprises:

if the image detection result includes N detection areas, determining that the N objects to be identified exist in the target frame image, where each detection area corresponds to one object to be identified, and each detection area corresponds to a set of position parameters.

13. The method according to claim 12, wherein determining an object recognition result for the target frame image from the candidate object information set includes:

And carrying out similarity calculation on the N target feature vectors and the M candidate feature vectors, and generating the object recognition result aiming at the target frame image according to the similarity.

14. The identification method of claim 1, wherein the method further comprises:

receiving an object correction request sent by a terminal and aiming at the object recognition result, wherein the object correction request carries a target video identifier, a target frame moment and target object information, and the object recognition result comprises the object information to be corrected;

responding to the object correction request, adding the target video identifier, the target frame time and the target object information into a sixth mapping relation, wherein the sixth mapping relation is used for representing the mapping relation among the video identifier, the frame time and the object information;

and responding to a correction instruction aiming at the target object information, and updating the object information to be corrected stored in a database into the target object information.

Alternatively, the method further comprises:

And responding to the object correction request, and if the accumulated times of the object correction request are greater than or equal to a time threshold, updating the object information to be corrected stored in a database into the target object information.

15. A method for identifying object information, comprising:

playing a target video on a video playing page, wherein the video playing page is displayed with an object identification control;

responding to touch operation aiming at the object identification control, and sending an object identification request to a server, wherein the object identification request carries a target video identification of the target video and a target frame time of a target frame image;

receiving an object recognition result sent by the server and aiming at the target frame image, wherein the object recognition result is determined after the server responds to the object recognition request, and the object recognition result is obtained by adopting the method of any one of claims 1 to 14;

and displaying the object identification result on the video playing page.

16. An object information identifying apparatus, comprising:

the acquisition module is used for acquiring a target video clip corresponding to a target frame image, wherein the target video clip comprises the target frame image;

The acquisition module is further configured to acquire association data for the target video segment, where the association data includes at least one of target participant information, target interaction information, target subtitle information, and target audio data;

a determining module, configured to determine a candidate object information set according to the association data, where the candidate object information set includes M candidate object information, where M is an integer greater than or equal to 1;

the determining module is further configured to determine, according to the candidate object information set, an object recognition result for the target frame image if N objects to be recognized exist in the target frame image, where N is an integer greater than or equal to 1.

17. An object information identifying apparatus, comprising:

the sending module is used for responding to the touch operation of the object identification control and sending an object identification request to the server, wherein the object identification request carries a target video identifier of the target video and a target frame time of a target frame image;

The receiving module is used for receiving an object recognition result sent by the server and aiming at the target frame image, wherein the object recognition result is determined after the server responds to the object recognition request, and the object recognition result is obtained by adopting the method of any one of claims 1 to 14;

18. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor being adapted to execute a program in the memory, the processor being adapted to perform the method of any one of claims 1 to 14 or to perform the method of claim 15 according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

19. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 14, or to perform the method of claim 15.

20. A computer program product comprising a computer program and instructions which, when executed by a processor, implement the method of any one of claims 1 to 14, or the method of claim 15.