CN110532433A

CN110532433A - Entity recognition method, device, electronic equipment and the medium of video scene

Info

Publication number: CN110532433A
Application number: CN201910828965.2A
Authority: CN
Inventors: 王述; 任可欣; 冯知凡; 张扬; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2019-12-03
Anticipated expiration: 2039-09-03
Also published as: CN110532433B

Abstract

This application discloses a kind of entity recognition method of video scene, device, electronic equipment and media, are related to artificial intelligence field.Specific implementation are as follows: obtain target video and at least one target modalities to be processed；Extract at least one target modalities feature of target video；According at least one target modalities, target entity recognizer to be used is determined from at least two candidate entity identification algorithms that server provides；Wherein, at least two candidate entity identification algorithms are deployed in the server；Invocation target entity identification algorithms carry out Entity recognition at least one target modalities feature, to obtain the target entity for including in target video.The application realizes the different modalities Entity recognition to target video, and recognition result accuracy is high, and can satisfy different business demands, versatile.

Description

Entity identification method and device for video scene, electronic equipment and medium

Technical Field

The embodiment of the application relates to a computer technology, in particular to an artificial intelligence technology, and specifically relates to an entity identification method, an entity identification device, electronic equipment and a medium for designing a video scene.

Background

With the development of information technology and the increasing explosion of various video apps, videos will become the most important information transmission mode and are widely applied to various aspects of interpersonal communication, social life and industrial production. For massive video contents, the manual processing cannot be completed, so that intelligent understanding of the video contents through a computer technology is urgently needed, and further, automatic and intelligent video classification and labeling are performed.

The traditional method is to perform entity recognition on a target video in a monomodal manner, for example, by using a simple video text feature or a video image feature. However, the entity identified by the method cannot completely cover the entity of the target video, the accuracy is low, and the entity identification technology is single, so that different business requirements cannot be met.

Disclosure of Invention

The embodiment of the application provides an entity identification method, an entity identification device, electronic equipment and a medium for a video scene, and can solve the problems that the entity identification accuracy is low and different service requirements cannot be met in the conventional method.

In a first aspect, an embodiment of the present application provides an entity identification method for a video scene, where the method includes:

acquiring a target video to be processed and at least one target modality;

extracting at least one target modal feature of the target video;

determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality; wherein the at least two candidate entity identification algorithms are both deployed in the server;

and calling the target entity identification algorithm to perform entity identification on the at least one target modal characteristic so as to obtain a target entity included in the target video.

One embodiment in the above application has the following advantages or benefits: the corresponding target entity recognition algorithm is called from the pre-configured candidate entity recognition algorithms according to the target modal characteristics of the target video, and the target modal characteristics are subjected to entity recognition, so that different modal entities of the target video are recognized, the recognition result accuracy is high, different service requirements can be met, and the universality is high.

Optionally, determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by the server according to the at least one target modality, includes:

if the target modality comprises a text modality, determining a target text entity recognition algorithm to be used from a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server;

if the target modality comprises a visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server;

and if the target modality comprises an audio modality, determining a target audio entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server.

One embodiment in the above application has the following advantages or benefits: corresponding entity recognition algorithm classes are respectively provided for a text mode, a visual mode and an audio mode, so that a user can call the entity recognition algorithms according to needs, different service requirements can be met, and the universality of the entity recognition algorithms is improved.

Optionally, if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server, including:

if the target modality comprises a text modality, calling a video field classification algorithm provided by a server to determine a target field to which the target video belongs; wherein the video domain classification algorithm is deployed in the server;

if the target field belongs to a preset field set, determining a knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

and if the target field does not belong to the preset field set, determining a text entity algorithm based on the correlation or a text entity algorithm based on the sequence as the target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server.

One embodiment in the above application has the following advantages or benefits: by providing a knowledge importance entity recognition algorithm for a target video belonging to a preset field set and providing a correlation-based text entity algorithm or a sequence-based text entity algorithm for other videos, the accuracy of entity recognition in different fields is further improved.

Optionally, if the target text entity recognition algorithm is a knowledge importance entity recognition algorithm, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, including:

calling a named entity recognition algorithm provided by a server to recognize candidate entities included in the text modal characteristics; wherein the named entity recognition algorithm is deployed in the server;

determining a target entity category associated with a target field according to a mapping relation between the preset field and the preset entity category in the knowledge importance entity recognition algorithm;

and taking the candidate entity belonging to the target entity category as the target entity in the target video.

One embodiment in the above application has the following advantages or benefits: the mapping relation between the preset field and the preset entity category is configured in the knowledge importance entity recognition algorithm, so that fine-grained entity recognition is provided for recognizing different types of entities in target videos belonging to different preset fields.

Optionally, after taking the candidate entity belonging to the target entity category as the target entity in the target video, the method further includes:

and calling an entity chain finger algorithm provided by the server to establish a link between the target entity and the entity in the knowledge graph.

One embodiment in the above application has the following advantages or benefits: by establishing the link relation between the target entity and the knowledge graph, the target entity can be deeply understood and expanded through the knowledge graph.

Optionally, determining a correlation-based text entity algorithm or a sequence-based text entity algorithm from among text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server as a target text entity recognition algorithm to be used comprises:

acquiring required target entity precision;

if the target entity precision is the first precision, determining a text entity algorithm based on the correlation as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server;

if the target entity precision is the second precision, determining a sequence-based text entity algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; wherein the first precision is lower than the second precision.

One embodiment in the above application has the following advantages or benefits: according to different precision requirements of users, different corresponding entity recognition algorithms are adopted for the users, the processing efficiency and the entity quality can be considered, and the resource waste is avoided.

Optionally, if the target modality includes a text modality, before invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, the method further includes:

calling a text quality model provided by a server to determine the text quality of the text modal characteristics;

and screening the text modal characteristics according to a determination result.

One embodiment in the above application has the following advantages or benefits: by screening the text modal characteristics in advance, the entity recognition efficiency is improved.

Optionally, if the target modality includes a visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server, including:

if the target modality comprises a visual modality, acquiring a target visual category concerned by the user;

if the target vision category is a human face, determining a human face entity recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server;

if the target vision category is an object, determining a vision object recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server;

and if the target vision category is the vision fingerprint, determining the vision fingerprint identification algorithm as the target vision entity identification algorithm to be used from the vision entity identification algorithm class of at least two candidate entity identification algorithms provided by the server.

One embodiment in the above application has the following advantages or benefits: by providing a face entity recognition algorithm, a visual object recognition algorithm or a visual fingerprint recognition algorithm for the video modality, entities of different visual categories can be recognized, and the categories of the visual entities are enriched.

Optionally, after invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, the method further includes:

and fusing the identified information of at least two entities.

One embodiment in the above application has the following advantages or benefits: and the polymerization degree of the entity is improved through the entity information fusion operation.

and adjusting the confidence level of the entity according to the modal source information of the entity and/or the correlation of the entity in the knowledge graph.

One embodiment in the above application has the following advantages or benefits: by combining the multi-modal information to perform multi-modal fusion and confidence verification, the annotated entity result has higher accuracy.

In a second aspect, an embodiment of the present application further discloses an apparatus for entity identification of a video scene, where the apparatus includes:

the target modality acquisition module is used for acquiring a target video to be processed and at least one target modality;

the target modal feature extraction module is used for extracting at least one target modal feature of the target video;

a target entity identification algorithm determining module, configured to determine, according to the at least one target modality, a target entity identification algorithm to be used from at least two candidate entity identification algorithms provided by the server; wherein the at least two candidate entity identification algorithms are both deployed in the server;

and the entity identification module is used for calling the target entity identification algorithm to perform entity identification on the at least one target modal characteristic so as to obtain a target entity included in the target video.

In a third aspect, an embodiment of the present application further discloses an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of the embodiments of the present application.

In a fourth aspect, embodiments of the present application further disclose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any of the embodiments of the present application.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart illustrating an entity identification method for a video scene according to a first embodiment of the present application;

fig. 2 is a flowchart illustrating an entity identification method for a video scene according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of an entity identification apparatus for a video scene according to a third embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing an entity identification method for a video scene according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example one

Fig. 1 is a schematic flowchart of an entity identification method for a video scene according to an embodiment of the present disclosure. The embodiment is suitable for the case of identifying the entity included in the video to be processed, and the method can be executed by the entity identification device of the video scene provided in the third embodiment of the present application, and the device can be implemented in a software and/or hardware manner. As shown in fig. 1, the method may include:

s101, obtaining a target video to be processed and at least one target mode.

The target video format includes, but is not limited to, AVI, FLV, RMVB and WMV format, and the target video is uploaded to the server by the user through the client. The target modality embodies which information of the target video the user wants to perform entity identification, for example, if the user wants to perform entity identification on text information of the target video, the target modality at this time is a text modality; for another example, if the user wants to perform entity identification on the visual information of the target video, the target modality at this time is the visual modality; for another example, if the user wants to perform entity recognition on the audio information of the target video, the target modality at this time is an audio modality. And selecting a target mode by the user through the client according to different service requirements.

In this embodiment, the target modality includes, but is not limited to, at least one of a text modality, a visual modality, and an audio modality.

By acquiring the target video to be processed and at least one target modality, a data base is laid for subsequently calling a corresponding target entity identification algorithm to perform entity identification on target modality characteristics according to the target modality to obtain a target entity included in the target video.

S102, extracting at least one target modal feature of the target video.

Specifically, if the target modality includes a text modality, extracting text modality features of the target video; if the target modality comprises a visual modality, extracting visual modality characteristics of the target video; and if the target modality comprises an audio modality, extracting the audio modality characteristics of the target video.

Optionally, the extracting of the text modal feature includes: 1) the subtitle information of the target video usually appears in the image of the target video, and the subtitle information in the extracted frame image is identified by extracting all frame images or key frame images of the target video and using Optical Character Recognition (OCR) to serve as the text modal feature of the target video. 2) The title and the description text of the target video are summarised by the author on the content of the target video, and are presented in a text form, so that the title and the description text of the target video are directly used as the text modal characteristics of the target video. 3) The audio information of the target video usually appears on a plurality of time nodes of the target video, all the audio information of the target video is extracted, and then text information corresponding to the audio information is identified by using an ASR (Automatic speech recognition), and finally, an ASR identification result is used as the text modal characteristic of the target video. 4) The author label text of the target video is the label text which is printed by the author of the target video to help the user to understand, and the author label text of the target video is presented in a text form, so that the author label text of the target video is directly used as the text modal characteristic of the target video.

Optionally, the extracting of the visual modal feature includes: extracting each frame image of the target video as a visual modal characteristic by using a video processing tool; or extracting the key frame image of the target video as the visual modal characteristic according to the actual business requirement.

Optionally, the extracting of the audio modal features includes: picking up all audio data of a target video from a starting time node to an ending time node by using an audio processing tool to serve as audio modal characteristics; or according to actual service requirements, picking up audio data in a target time range of the target video as an audio modal characteristic.

By extracting at least one target modal feature of the target video, a data base is laid for subsequently calling a target entity recognition algorithm to perform entity recognition on the target modal feature.

S103, determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by the server according to the at least one target modality; wherein the at least two candidate entity identification algorithms are both deployed in the server.

The server provides at least two candidate entity algorithms to perform entity recognition on different target modal characteristics, for example, the server provides a text entity recognition algorithm class and a visual entity recognition algorithm class; also for example, the server provides a class of text entity recognition algorithms as well as a class of audio entity recognition algorithms. And each kind of entity algorithm comprises at least one target entity recognition algorithm.

Specifically, according to the target modality, the target entity recognition algorithm is determined in different algorithms of at least two candidate entity recognition algorithms provided by the server.

Optionally, if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server; if the target modality comprises a visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; and if the target modality comprises an audio modality, determining a target audio entity recognition algorithm to be used from the audio entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server.

And 1) if the target modality comprises a text modality, determining a corresponding target text entity recognition algorithm according to the field to which the target video belongs.

Optionally, if the target modality includes a text modality, calling a video domain classification algorithm provided by a server, and determining a target domain to which the target video belongs; wherein the video domain classification algorithm is deployed in the server; if the target field belongs to a preset field set, determining a knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target field does not belong to the preset field set, determining a text entity algorithm based on correlation or a text entity algorithm based on sequence as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; the text entity algorithm based on the relevance comprises at least one of the following steps: an entity algorithm based on xgboost (eXtreme Gradient Boosting) classification, an entity algorithm based on word2vec, a textrank graph walk algorithm, a wordrank (text ranking) algorithm based on term importance, and a ranking algorithm based on tf-idf (term frequency-inverse document frequency, information retrieval and data mining); the Network structure of the text entity algorithm based on the sequence is BilSTM-CRF (Bidirectional Long Short-Term Memory Network Conditional random field) based on the Bidirectional Long Short-Term Memory Network Conditional random field.

2) And if the target modality comprises the visual modality, determining a corresponding target visual entity recognition algorithm according to the target visual category concerned by the user.

Optionally, if the target modality includes a visual modality, a target visual category focused by the user is further obtained; if the target vision category is a human face, determining a human face entity recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; if the target vision category is an object, determining a vision object recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; and if the target vision category is the vision fingerprint, determining the vision fingerprint identification algorithm as the target vision entity identification algorithm to be used from the vision entity identification algorithm class of at least two candidate entity identification algorithms provided by the server.

3) And if the target modality comprises an audio modality, determining a corresponding target audio entity recognition algorithm according to the target entity precision of the user.

Optionally, if the target entity precision is a third precision, determining an audio entity algorithm based on acoustic features as a target text entity recognition algorithm to be used from among audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; if the target entity precision is the fourth precision, determining the voice print-based audio entity algorithm as a target text entity recognition algorithm to be used from the audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; wherein the third precision is lower than the fourth precision.

The target entity recognition algorithm to be used is determined from the at least two candidate entity recognition algorithms provided by the server according to the at least one target modality, the multi-modal entity recognition requirement of the target video is met, and a foundation is laid for subsequently calling the target entity recognition algorithm to perform entity recognition on at least one target modality feature.

S104, calling the target entity recognition algorithm to perform entity recognition on the at least one target modal characteristic so as to obtain a target entity included in the target video.

The entity recognition is carried out on at least one target modal characteristic by calling a target entity recognition algorithm, the multi-modal entity recognition of the target video is realized, the recognition result accuracy is high, and different service requirements can be met.

According to the technical scheme provided by the embodiment of the invention, the corresponding target entity recognition algorithm is called from the pre-configured candidate entity recognition algorithms according to the target modal characteristics of the target video, and the target modal characteristics are subjected to entity recognition, so that the multi-modal entity recognition of the target video is realized, the recognition result accuracy is high, and different service requirements can be met.

On the basis of the above embodiment, after S104, the method further includes: and fusing the identified information of at least two entities. For example, if at least two entity words are associated with the same entity, the at least two entity words are normalized.

Specifically, cross check is carried out on the recognized entity words according to a knowledge graph stored in a knowledge base, if at least two entity words are determined to be associated with the same entity according to the knowledge graph, normalization processing is carried out on the at least two entity words, and the entity words after normalization processing are subject to hot entity words in the knowledge graph.

For example, assuming that the current three entity words "call a", "call B", and "call C" are three different names of an actor, respectively, and it is determined from the knowledge graph that "call a", "call B", and "call C" all refer to the actor, the "call a", "call B", and "call C" are normalized, assuming that "call a" is a popular call for the actor, the normalized entity word is set to "call a", and similarly, if "call B" is a popular call for the actor, the normalized entity word is set to "call B".

By fusing the at least two identified entity information, the entity identification effect is improved, and the entity identification result has higher accuracy.

On the basis of the above embodiment, after S104, the method further includes: and adjusting the confidence level of the entity according to the modal source information of the entity and/or the correlation of the entity in the knowledge graph.

Optionally, adjusting the confidence of the entity according to the modal source information of the entity includes:

if any entity has at least two modal sources, the confidence of the entity is improved; if any entity has only one unique modal source and the unique modal source belongs to a third type of source, the confidence level of the entity is reduced.

In the embodiment of the present application, the third type of modal source includes an audio modal and an author-tagged text in a text modal source, that is, if the only modal source of an entity is the author-tagged text in the audio modal or text modal source, the confidence of the entity is reduced.

Optionally, adjusting the confidence of the entity according to the relevance of the entity in the knowledge graph includes:

if the correlation between at least two entities is determined based on the knowledge graph in the knowledge base, the confidence degrees of the two entities are improved; if any entity word is determined to have no correlation with other entity words based on the knowledge graph in the knowledge base, the confidence coefficient of the entity is reduced, or if at least two entities have a contradiction relationship, the confidence coefficients of the two entities are reduced.

For example, if "actor a" plays a movie in "actor B", then "actor a" has a correlation with "actor B", and the confidence of the entity "actor a" and the entity "actor B" is correspondingly improved; for another example, if "actor C" is used as actor one in a movie and "actor D" is also used as actor one in the same movie, there is a contradiction between "actor C" and "actor D", and the confidence between "actor C" and the entity "movie D" is correspondingly reduced.

By adjusting the confidence level of the entity, the user can determine the video which the user wants to browse according to the confidence level of the entity, and the user experience is improved.

Example two

Fig. 2 is a schematic flowchart of an entity identification method for a video scene according to a second embodiment of the present disclosure. The embodiment provides a specific implementation manner for the first embodiment, and as shown in fig. 2, the method may include:

s201, obtaining a target video to be processed and at least one target modality.

S202, extracting at least one target modal feature of the target video.

S203, if the target mode comprises a text mode, executing S204; if the target modality includes a visual modality, executing S208; if the target modality includes an audio modality, S209 is performed.

And S204, calling a video field classification algorithm provided by the server, and determining the target field to which the target video belongs.

The target field represents the category of the target video content, such as movie, game, sports, and the like. The video domain classification algorithm adopts a video secondary classification system, namely the obtained expression form of the target domain is a secondary classification form, such as a movie and television domain-a drama and a sports domain-a basketball.

Specifically, 1) a 3D convolutional neural network is established in advance, and a target video frame is extracted from a target video, wherein the target video frame may be a plurality of key frames or a video frame extracted every second. 2) The established 3D convolution neural network is used for generating multi-channel information from the target video frame, and then convolution and down-sampling operations are separately carried out on each channel. 3) And combining the information of all channels to obtain the final characteristic description. 4) And establishing a bidirectional LSTM sequence model added with a self-attention mechanism to perform sequence modeling on the feature description, and finally fusing and classifying to obtain the target field of the target video correspondingly.

By determining the target field to which the target video belongs, a foundation is laid for determining a corresponding target entity recognition algorithm according to different fields of the target video subsequently.

S205, determining whether the target field belongs to a preset field set or not; if yes, executing S207; if not, go to S206.

The preset field set is determined according to the attention of the user to the field to which the video belongs, and a plurality of fields with high attention of the user to the field to which the video belongs are selected as the preset field set. In order to improve the accuracy of extracting the target video entity, in this embodiment, different target text entity recognition algorithms are correspondingly invoked to perform entity recognition on the text modal characteristics according to the difference of target fields to which the target video belongs.

Optionally, the preset domain set optionally includes at least one of the following preset domains: the fields of movie and television, entertainment, animation, games, music, automobiles, dancing, cate, sports and nature.

Specifically, the target field is compared with a preset field set, and if the preset set comprises the target field, the target field is determined to belong to the preset field set; and if the target field is not included in the preset set, determining that the target field does not belong to the preset field set.

S206, determining the text entity algorithm based on the correlation or the text entity algorithm based on the sequence as the target text entity recognition algorithm to be used in the text entity recognition algorithm of at least two candidate entity recognition algorithms provided by the server.

The text entity algorithm based on the relevance comprises at least one of the following steps: an entity algorithm based on xgboost classification, an entity algorithm based on word2vec, a textrank graph walk algorithm, a wordrank algorithm based on term importance, and a sorting algorithm based on tf-idf; the network structure of the text entity algorithm based on the sequence is BilSTM-CRF.

Specifically, the recognition accuracy of the correlation-based text entity algorithm is lower than that of the sequence-based text entity algorithm, but the recognition speed of the correlation-based text entity algorithm is higher than that of the sequence-based text entity algorithm. The user can select the target entity precision or the target entity speed according to the actual service requirement in the client.

Optionally, S206 includes:

acquiring required target entity precision; if the target entity precision is the first precision, determining a text entity algorithm based on the correlation as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; if the target entity precision is the second precision, determining a sequence-based text entity algorithm as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by a server; wherein the first precision is lower than the second precision.

And S207, determining the knowledge importance entity recognition algorithm as a target text entity recognition algorithm to be used in the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server.

The main flow of the knowledge importance entity recognition algorithm is as follows:

1) and calling a named entity identification algorithm provided by the server to identify candidate entities included in the text modal characteristics.

Specifically, the candidate entities included in the text modal feature are determined by a named entity recognition tool, where the named entity recognition tool may be a neural network-based named entity recognition, or a feature template-based named entity recognition, and this embodiment is not limited. Candidate entities include three broad categories of named entities: entity class, time class and numeric class, and seven subclasses named entities: name of person, organization name, place name, time, date, currency, and percentage.

For example, if a text mode feature is entered into the named entity recognition tool, the candidate entities for the text mode feature, e.g., "michael jordan is a basketball player in chicago bull team" and "michael jordan", "chicago bull team" and "basketball player" would be output accordingly.

2) And determining the target entity category associated with the target field according to the mapping relation between the preset field and the preset entity category in the knowledge importance entity recognition algorithm.

The preset entity categories corresponding to different preset fields are different, and for the same preset field, the preset entity categories concerned by the user are basically fixed, for example, in the movie and television field, the concerned points of the user are concentrated on movie and television series names, main roles and main actors, so that the movie and television series names, the main roles and the main actors are used as the preset entity categories of the movie and television field; for another example, in the automobile field, the user's attention is focused on the brand, model, and name of the automobile, and thus "the brand of the automobile", "the model of the automobile", and "the name of the automobile" are taken as the preset entity categories of "the automobile field".

Specifically, a schema (policy) system having a mapping relation with preset entity categories is constructed for different preset fields.

The mapping relationship between the preset field and the preset entity category in the knowledge importance entity recognition algorithm optionally includes at least one of the following items: at least one entity category related to the film and television field as follows: movie title, main role and main actor; the entertainment domain is associated with at least one of the following entity categories: the name of the comprehensive art program, guests, the host and the related entertainment figure; the animation field is associated with at least one entity category as follows: animation names and primary roles; the game field is associated with at least one entity category as follows: game name, primary character and player; the music domain is associated with at least one entity category as follows: music title and singer; the automotive field is associated with at least one entity category as follows: car brand, model, and name; the dance field is associated with at least one entity category as follows: dance names and dance actors; the food field is associated with at least one entity category as follows: name of the food; the sports domain is associated with at least one entity category selected from the group consisting of: sports, athlete and team names; the nature domain is associated with at least one entity category as follows: animals, plants, mountains and rivers.

For example, assuming that the target domain to which the target video belongs is a "movie and television domain", and the preset entity categories of the preset domain "movie and television domain" in the knowledge importance entity recognition algorithm are "movie and television series names", "main roles" and "main actors", the target entity categories associated with the "movie and television domain" to which the target video belongs are "movie and television series names", "main roles" and "main actors"; assuming that the target field to which the target video belongs is a "game field", and the preset entity categories of the preset field "game field" in the knowledge importance degree entity recognition algorithm are "game name", "primary character", and "player", the target entity categories associated with the target field "game field" to which the target video belongs are "game name", "primary character", and "player".

3) And taking the candidate entity belonging to the target entity category as the target entity in the target video.

Specifically, the candidate entity is matched with the target entity category, and the candidate entity belonging to the target entity category is used as the target entity in the target video according to the matching result.

Optionally, the candidate entity is matched with the target entity category based on the knowledge graph information of the target entity category in the knowledge base, and whether the candidate entity belongs to the target entity category is determined.

For example, assuming that the candidate entities are "actor a", "actor B", "character a", "theme song a", "episode a", and "movie a", and the target entity categories are "movie title", "main character", and "main actor", it is determined that "actor a" and "actor B" belong to the target entity category "main actor", "character a" belongs to the target entity category "main character", and "movie a" belongs to the target entity category "movie title" based on the knowledge map information of "movie title", "main character", and "main actor" in the knowledge base, and thus "actor a", "actor B", "character a", and "movie a" in the candidate entities are taken as target entities in the target video.

Illustratively, assume that the candidate entities are "singer A", "singer B", "subject song A", "subject song B"),

"album a", "musical instrument a", and "musical instrument B", the target entity category is "music title" and "singer", it is determined that "singer a", "singer B" belong to the target entity category "singer", "subject music a", "subject music B", and "album a" belong to the target entity category "music title" based on the knowledge-graph information of "music title" and "singer" in the knowledge base, and thus "singer a", "singer B", "subject music a", "subject music B", and "album a" in the candidate entities are taken as the target entities in the target video.

Optionally, after "taking the candidate entity belonging to the target entity category as the target entity in the target video", the method further includes:

By establishing the link between the target entity and the entity in the knowledge graph, the obtained target entity can be corrected and the entity can be expanded, and the richness of the entity identification result is increased.

S208, determining a target visual entity recognition algorithm to be used from the visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server.

Specifically, the server may determine the target visual entity recognition algorithm according to the actual business requirements of the user.

Optionally, S208 includes:

acquiring a target visual category concerned by a user; if the target vision category is a human face, determining a human face entity recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; if the target vision category is an object, determining a vision object recognition algorithm as a target vision entity recognition algorithm to be used from a vision entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server; and if the target vision category is the vision fingerprint, determining the vision fingerprint identification algorithm as the target vision entity identification algorithm to be used from the vision entity identification algorithm class of at least two candidate entity identification algorithms provided by the server.

The face entity recognition algorithm optionally comprises a deepID algorithm and the like; the visual object recognition algorithm may optionally include yolo algorithm, etc.

S209, determining a target audio entity recognition algorithm to be used from the audio entity recognition algorithm class of the at least two candidate entity recognition algorithms provided by the server.

Specifically, a corresponding target audio entity recognition algorithm is determined according to the target entity precision of the user.

If the target entity precision is the third precision, determining the audio entity algorithm based on the acoustic characteristics as a target text entity recognition algorithm to be used from the audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; if the target entity precision is the fourth precision, determining the voice print-based audio entity algorithm as a target text entity recognition algorithm to be used from the audio entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server; wherein the third precision is lower than the fourth precision.

S210, calling the target entity recognition algorithm to perform entity recognition on the at least one target modal characteristic so as to obtain a target entity included in the target video.

For example, assuming that the target entity recognition algorithm is an entity algorithm based on xgboost classification, the text modal features are input into a pre-established xgboost text classifier to perform entity recognition and determination on the text modal features.

Exemplarily, assuming that the target entity recognition algorithm is a text entity algorithm based on a sequence with a network structure of BilSTM-CRF, the text mode features are input into a pre-established entity sequence labeling model to perform sequence labeling on the text mode features to obtain a target entity, wherein the entity sequence labeling model is obtained by combining a bidirectional LSTM with a CRF sequence labeling model and combining large-scale video text feature corpus training.

According to the technical scheme provided by the embodiment of the application, if the target modality comprises a text modality, whether a target field to which a target video belongs to a preset field set is determined, if so, a text entity algorithm based on correlation or a text entity algorithm based on a sequence is called to serve as a target text entity recognition algorithm to be used, and if not, a knowledge importance entity recognition algorithm is called to serve as a target text entity recognition algorithm to be used; if the target modality comprises the visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by the server; if the target modality comprises an audio modality, determining a target audio entity recognition algorithm to be used from the audio entity recognition algorithms of at least two candidate entity recognition algorithms provided by the server, realizing multi-modal entity recognition of the target video, having high recognition result accuracy and being capable of meeting different service requirements.

On the basis of the above embodiment, if the target modality includes a text modality, before invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, the method further includes:

calling a text quality model provided by a server to determine the text quality of the text modal characteristics; and screening the text modal characteristics according to a determination result.

The text quality model is used for determining the quality of the text features, and optionally comprises a fasttext text classification model.

Optionally, the text features are input into a pre-established fasttext classification model to obtain the text quality of the text features, the text features lower than a preset text quality threshold are removed, and the text features higher than the preset text quality threshold are retained.

Because knowledge information can not be obtained from the low-quality text features, the text features are screened according to the text quality, the text quality of the residual text features is ensured, the text noise of the text features is filtered, and the problem of 'heading party' is avoided.

On the basis of the above embodiment, after S210, the method further includes:

and according to the obtained target entity, marking a flattened entity label on the target video, or according to the obtained target entity and the knowledge graph, marking a structured entity label on the target video.

The flattened entity label is a label for juxtaposing all entities. The structured entity labels embody the relationship between the labels.

Illustratively, "movie a" is played by "actor a" and "actor B," actor a "plays" character a, "actor B" plays "character B," and character a "is in a relation of sincere and friends with" character B.

By labeling the target video, a user can conveniently inquire interested videos according to the label according to the needs of the user.

On the basis of the above embodiment, after S210, the method further includes:

and vectorizing the semantic representation of the entity to obtain the semantic representation of the target video, wherein the semantic representation is used for downstream application services, such as video semantic search, video recommendation and the like.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an entity identification apparatus 300 for a video scene according to a third embodiment of the present application, which is capable of performing an entity identification method for a video scene according to any embodiment of the present application, and has functional modules and beneficial effects corresponding to the performing method. As shown in fig. 3, the apparatus may include:

a target modality obtaining module 301, configured to obtain a target video to be processed and at least one target modality;

a target modal feature extraction module 302, configured to extract at least one target modal feature of the target video;

a target entity identification algorithm determining module 303, configured to determine, according to the at least one target modality, a target entity identification algorithm to be used from at least two candidate entity identification algorithms provided by the server; wherein the at least two candidate entity identification algorithms are both deployed in the server;

an entity identification module 304, configured to invoke the target entity identification algorithm to perform entity identification on the at least one target modal feature, so as to obtain a target entity included in the target video.

On the basis of the foregoing embodiment, the target entity identification algorithm determining module 303 is specifically configured to:

On the basis of the foregoing embodiment, the target entity identification algorithm determining module 303 is further specifically configured to:

if the target field does not belong to the preset field set, determining a text entity algorithm based on correlation or a text entity algorithm based on sequence as a target text entity recognition algorithm to be used from text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server;

On the basis of the above embodiment, the apparatus further includes an entity chain instruction algorithm calling module, configured to:

acquiring required target entity precision;

On the basis of the above embodiment, the apparatus further includes a text modal feature filtering module, configured to:

On the basis of the above embodiment, the apparatus further includes an entity information fusion module, configured to:

and fusing the identified information of at least two entities.

For example, if at least two entity words are associated with the same entity, then the at least two entity words are associated with the same entity

On the basis of the above embodiment, the apparatus further includes a confidence adjustment module, configured to:

The entity identification device 300 for a video scene, provided by the embodiment of the present application, can execute the entity identification method for a video scene provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. For details of the technology not described in detail in this embodiment, reference may be made to an entity identification method for a video scene provided in any embodiment of the present application.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 4, the electronic device is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for entity identification of video scenes provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of entity identification of a video scene provided herein.

The memory 402, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for entity identification of a video scene in the embodiments of the present application (for example, the target modality acquisition module 301, the target modality feature extraction module 302, the target entity identification algorithm determination module 303, and the entity identification module 304 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing, i.e., a method for realizing entity identification of a video scene in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device identified by an entity of the video scene, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, which may be connected over a network to an entity-identified electronic device of the video scene. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of entity identification of a video scene may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device for physical recognition of the video scene, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, as various entity recognition algorithms are preset to perform entity recognition on videos in different fields, entities described by video texts can be accurately extracted, and the integrity of recognition results is ensured.

…

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for entity identification of a video scene, comprising:

acquiring a target video to be processed and at least one target modality;

extracting at least one target modal feature of the target video;

2. The method according to claim 1, wherein determining a target entity recognition algorithm to be used from at least two candidate entity recognition algorithms provided by a server according to the at least one target modality comprises:

3. The method according to claim 2, wherein if the target modality includes a text modality, determining a target text entity recognition algorithm to be used from among a text entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server comprises:

4. The method according to claim 3, wherein if the target text entity recognition algorithm is a knowledge importance entity recognition algorithm, invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature comprises:

5. The method of claim 4, wherein after the candidate entities belonging to the target entity category are the target entities in the target video, further comprising:

6. The method of claim 3, wherein determining a relevance-based text entity algorithm or a sequence-based text entity algorithm from among the text entity recognition algorithm classes of at least two candidate entity recognition algorithms provided by the server as the target text entity recognition algorithm to be used comprises:

acquiring required target entity precision;

7. The method according to claim 1, wherein if the target modality includes a text modality, before invoking a text entity recognition algorithm to perform entity recognition on the at least one target modality feature, further comprising:

8. The method according to claim 2, wherein if the target modality comprises a visual modality, determining a target visual entity recognition algorithm to be used from a visual entity recognition algorithm class of at least two candidate entity recognition algorithms provided by a server comprises:

9. The method according to claim 1, wherein after invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, further comprising:

and fusing the identified information of at least two entities.

10. The method according to claim 1, wherein after invoking the target entity recognition algorithm to perform entity recognition on the at least one target modal feature, further comprising:

11. An apparatus for entity recognition of a video scene, comprising:

12. The apparatus of claim 11, wherein the target entity identification algorithm determination module is specifically configured to:

13. The apparatus according to claim 12, wherein the target entity identification algorithm determining module is further configured to:

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.