CN116028668A

CN116028668A - Information processing method, apparatus, computer device, and storage medium

Info

Publication number: CN116028668A
Application number: CN202111260298.6A
Authority: CN
Inventors: 李作潮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-04-28

Abstract

The embodiment of the application discloses an information processing method, an information processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring search information, and determining an initial video set based on the search information; acquiring video description information of each initial video, and determining an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video; determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video to obtain an input text set corresponding to the initial video set; carrying out category identification on each input text to obtain a category identification result of each input text; determining a target input text from the input text set, and associating an initial video corresponding to the video description information included in the target input text with the search information; the target input text includes: the category recognition result is the input text of the target category recognition result. To enhance the relevance of search information to recall video.

Description

Information processing method, apparatus, computer device, and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an information processing method, an information processing device, a computer device, and a storage medium.

Background

With the rapid development of computer technology and multimedia technology, more and more short videos may be shared to various platforms for viewing by users. In general, a user may perform a search operation on various video platforms, for example, entering relevant search information in a search input box to obtain a desired short video. Currently, when video recommendation is performed based on search information of users, short videos are mostly recalled based on text strategies, but the strategies may be only based on that certain query words are hit in the search information, for example, the short videos recalled by the search information 'Dajiang Daghen' may be videos about rivers, and the recalled videos may contain a large number of movie fragments which the users do not want to watch. Therefore, how to recall a short video that is more accurate based on search information of a user becomes a current research hotspot.

Disclosure of Invention

The embodiment of the application provides an information processing method, an information processing device, computer equipment and a storage medium, which can dig out required entities according to search information of a user, so that the entities are involved in video content of a recall video obtained based on the search information, the relevance between the search information and the recall video is effectively enhanced, more accurate video recommendation can be realized conveniently, and user experience is improved. At the same time, the entity can be uniquely represented based on various attribute characteristics so as to realize the disambiguation of the entity.

The first aspect of the embodiment of the application discloses an information processing method, which comprises the following steps:

acquiring search information input by a user, and determining an initial video set based on the search information, wherein the initial video set comprises one or more initial videos;

acquiring video description information of each initial video, and determining an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video;

determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video to obtain an input text set corresponding to the initial video set; the input text set comprises one or more input texts, wherein one input text corresponds to one video description information and entity information of one entity;

carrying out category identification on each input text to obtain a category identification result of each input text;

determining a target input text from the input text set, and associating an initial video corresponding to video description information included in the target input text with the search information; the target input text includes: the category recognition result is the input text of the target category recognition result.

A second aspect of an embodiment of the present application discloses an information processing apparatus, including:

the system comprises an acquisition unit, a search unit and a display unit, wherein the acquisition unit is used for acquiring search information input by a user, and determining an initial video set based on the search information, wherein the initial video set comprises one or more initial videos;

the first determining unit is used for acquiring video description information of each initial video and determining an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video;

the second determining unit is used for determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video so as to obtain an input text set corresponding to the initial video set; the input text set comprises one or more input texts, wherein one input text corresponds to one video description information and entity information of one entity;

the recognition unit is used for carrying out category recognition on each input text to obtain a category recognition result of each input text;

the association unit is used for determining a target input text from the input text set and associating an initial video corresponding to the video description information included in the target input text with the search information; the target input text includes: the category recognition result is the input text of the target category recognition result.

A third aspect of the embodiments of the present application discloses a computer device, including a processor, a memory, and a network interface, where the processor, the memory, and the network interface are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method of the first aspect.

A fourth aspect of the present application discloses a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect described above.

A fifth aspect of the embodiments of the present application discloses a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the method of the first aspect described above.

In the embodiment of the application, the search information input by the user can be acquired, and the initial video set is determined based on the search information; then, the video description information of each initial video may be acquired to determine an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video. Then, input text may be determined based on the video description information of each initial video and the entity information corresponding to each initial video, so as to obtain an input text set corresponding to the initial video set. Wherein an input text corresponds to a video description information and entity information of an entity. And after obtaining the set of input texts, the category recognition result of each input text may be further determined to determine an initial video having an association with the search information according to the category recognition result of each input text. For example, a target input text may be determined from the set of input texts, and an initial video corresponding to video description information included in the target input text may be associated with the search information. Wherein, the target input text may refer to: the category recognition result is the input text of the target category recognition result. By implementing the method, the required entities can be mined according to the search information of the user, so that the entities are involved in the video content of the recall video (namely the initial video related to the search information) obtained based on the search information, the relevance between the search information and the recall video is effectively enhanced, more accurate video recommendation can be realized conveniently, and the user experience is improved. Meanwhile, the entity can be uniquely represented based on various attribute characteristics, so that an entity disambiguation function is realized; further, based on the text formed by the entity part and the video description part, the entity which is truly related to the video can be screened out from a plurality of similar entities, so that the inaccuracy problem caused by the similar entities when the video is recalled is avoided, and the entity link with higher precision is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an information handling system according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an information processing method according to an embodiment of the present application;

fig. 3a is a schematic structural diagram of acquiring search information according to an embodiment of the present application;

FIG. 3b is an interface diagram of a video search interface according to an embodiment of the present application;

fig. 4 is a schematic flow chart of an information processing method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of training a pre-training model according to an embodiment of the present application;

fig. 6 is a schematic structural view of an information processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The Computer Vision technology (CV) is a science for researching how to make a machine "look at", and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as identifying, tracking and measuring on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning/deep learning typically includes techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Based on the computer vision technology, the machine learning technology and the like mentioned in the artificial intelligence technology, the embodiment of the application provides an information processing scheme; specifically, the scheme is roughly based on the following principle: firstly, search information input by a user can be acquired, and an initial video set is determined based on the search information; then, the video description information of each initial video may be acquired to determine an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video. Where an entity may refer to an IP (Intellectual Property ) entity, where IP may refer to legal terminology of "mind creation" (Creations of the Mind), which may generally include music, literature, and other works of art, an IP entity may be specifically understood herein as a movie work (or as a movie play). The concrete expression form of the entity can refer to the corresponding play name of the movie play, for example, the entity is 'mountain sea condition' or 'great river', etc. Then, input text may be determined based on the video description information of each initial video and the entity information corresponding to each initial video, so as to obtain an input text set corresponding to the initial video set. Wherein an input text corresponds to a video description information and entity information of an entity. And after the set of input text is obtained, a category recognition result for each input text may be further determined to determine an initial video associated with the search information based on the category recognition result for each input text. For example, a target input text may be determined from the set of input texts, and an initial video corresponding to the video description information included in the target input text may be associated with the search information. Wherein, the target input text may refer to: the category recognition result is the input text of the target category recognition result. By implementing the method, entities (concepts) such as movie dramas and movie drama characters can be analyzed according to the search information of the user, so that the entities are involved in the video content of the recall video (namely the initial video related to the search information) obtained based on the search information, the relevance between the search information and the recall video is effectively enhanced, more accurate video recommendation can be realized later, and the user experience is improved. Meanwhile, the entity can be uniquely represented based on various attribute characteristics, so that an entity disambiguation function is realized; further, based on the text formed by the entity part and the video description part, the entity which is truly related to the video can be screened out from a plurality of similar entities, so that the inaccuracy problem caused by the similar entities when the video is recalled is avoided, and the entity link with higher precision is achieved.

In a specific implementation, the execution subject of the above-mentioned information processing scheme may be a computer device, which may be a terminal or a server. The terminal mentioned here may be a smart phone, a tablet computer, a notebook computer, a desktop computer, or an external device such as a handle, a touch screen, etc.; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like. It should be noted that, when the computer device is a server, the embodiment of the present application provides an information processing system, as shown in fig. 1, where the information processing system includes at least one terminal and at least one server; the terminal can acquire search information input by a user and upload the acquired search information to a server (i.e. computer equipment), so that the computer equipment can acquire the search information, determine an initial video set according to the search information, and further determine videos to be recommended according to the initial video set.

In one implementation, the above information processing scheme may be applied to various application software such as video search, and video recommendation scenes are implemented using the various application software. If the video recommendation method and device are applied to video recommendation, the accuracy of video recommendation can be effectively improved by mining the entity included in the search information, so that the recommended video meets the requirements of users, and user experience can be improved. For example, a user may input search information on an associated interface of the video search application, and after the computer device obtains the search information, recall videos associated with the search information (i.e., the initial videos described above as being associated with the search information) may be obtained based on the information processing scheme in the present application, and video recommendations may be made to the recall videos. The search information and the recall videos can be associated and stored, so that the follow-up search information input by a user can be matched with the search information in the storage directly, video recommendation can be performed by using the recall videos corresponding to the matched search information, and the video recommendation speed can be increased. In another embodiment, for search information and recall video stored in association, the required video can also be retrieved directly from storage later in the relevant training of the model. If in obtaining a training sample of a certain model, when the model needs to be trained by using videos with relevance to certain entities, the workload caused by the need of inquiring videos from a massive video library can be reduced, so that the working efficiency is improved. It should be noted that, the present application mainly uses a video recommendation scenario as an example.

Alternatively, the above-mentioned information processing scheme may be performed by the terminal and the server together. For example, after acquiring the search information input by the user, the terminal may also determine an initial video set based on the search information, determine an entity set based on the initial video set, and entity information of each entity in the entity set, and further determine an input text set based on the video description information and the entity information. And uploading the determined input text set to a server by the terminal, so that the server can directly perform category identification on each input text in the input text set, and the initial video with relevance to the search information is determined according to the category identification result. As another example, after acquiring the search information input by the user, the terminal may also determine the initial video set based on the search information. The determined initial video set is then uploaded to the server by the terminal so that the server can perform subsequent determination operations (e.g., determining an entity set, determining entity information, determining an input text set, etc.) and recognition operations, etc., based on the uploaded initial video set. It should be noted that, when the information processing scheme is executed by the terminal and the server together, the terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.

Based on the information processing scheme provided above, the embodiments of the present application provide an information processing method that can be executed by the computer device (i.e., terminal or server) mentioned above; alternatively, the information processing method may be performed by a terminal and a server together. For convenience of explanation, the embodiments of the present application will be described by taking a computer device to execute the information processing method as an example. Referring to fig. 2, the information processing method may include the following steps S201 to S205:

s201: search information input by a user is acquired, and an initial video set is determined based on the search information.

In one implementation, when a user needs to search or browse some videos, where the videos may specifically refer to various video videos, the user may perform related operations on a video search interface output by the terminal to obtain a desired video recommendation, and display the video recommendation on the video search interface. Alternatively, the search information may be obtained based on information entered by the user at the video search interface. In one embodiment, the video search interface may include search information settings that may be used by a user to enter search information and a results display area for displaying the resulting video to be recommended. See, for example, fig. 3 a: the terminal used by the user may display a video search interface on the terminal screen, which may include at least a search information setting area marked by 301. If the user wants to search or browse some videos, the user can input related information of the videos in the search information setting area 301, for example, the related information can be a name of a movie and television play (such as a great river, mountain sea situation), a description of a specific scenario in the movie and television play (such as a great river 2 Liang Saishen returning to the country), and so on. After detecting the search information in the search information setting area, the terminal can thereby acquire the search information for subsequent operations. The above example is described by taking a computer device as a server (i.e., the server executes an information processing method).

In one implementation, after the search information entered by the user is obtained, a corresponding initial video set, which may include one or more initial videos, may be determined based on the search information. Alternatively, the initial video set may be determined based on a policy such as keywords or historical search records. For example, the search information input by the user is "Dajiang river 2 Liang Saishen is put back to the country", and the keywords in the search information are exemplified by the keywords, which may include Dajiang river 2, liang Saishen, dajiang river, and the like, based on which the videos related to the keywords may be queried from the video library, and the queried videos may be the initial videos.

Alternatively, the video (such as the initial video) referred to in the present application may refer to a short video, and the short video may refer to a video clip in a movie, or a video made by a public, etc., where the video in the present application is mainly described by taking the video clip in the movie as an example.

S202: and acquiring video description information of each initial video, and determining an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video.

In one implementation, video description information of each initial video may be obtained, where the video description information may refer to a description of the content of the initial video, such as a description for a scenario segment of one initial video or a specific scenario in the initial video. The video description information may be characterized by a video title or other means of video, and typically, when a video is displayed on the terminal screen, the video is accompanied by a corresponding video title, which may be used as the video description information. For example, the video title of an initial video may be "red dragon" which is a large piece of action that cannot be fast forwarded for one second, and is super-strong ", and the video title may be used as the video description information of the initial video. As another example, the video title of a certain initial video may be "Dajiang river 2 Liang Saishen national", and the video title may be used as the video description information of the initial video.

After the video description information of each initial video is acquired, for any initial video in the initial video set, the entity set for any initial video and the entity information corresponding to each entity in the entity set can be determined based on the video description information of any initial video. The entity in the entity set may refer to a movie and play entity, and the concrete expression form of the entity may refer to a play name corresponding to a movie and play, for example, a certain entity is "mountain sea condition" or "great river", etc. The following description will take any one of the initial videos in the initial video set as an example. Optionally, when determining the entity set of any initial video, the video description information of any initial video may be matched with a plurality of reference entities in the knowledge base, and the matched reference entities are added to the entity set of any initial video, where the entity set may include one or more entities. The above matching operation may refer to text matching, that is, whether a certain reference entity exists in the knowledge base in the video description information, and if a certain reference entity exists in the video description information, determining that the reference entity matches with the video description information.

The knowledge base may include a large number of reference entities, and considering that the video search referred to in the present application is a search of various movie fragments, the reference entities may be specifically understood as the names of various movie dramas. For example, reference entities may include mountain and sea conditions, shelters, large rivers, halibut, red dragon, etc., all of which are names of television shows or movies.

For example, a video description information is "red dragon" which is a large piece of motion that cannot be fast forwarded for one second, and is super-strong ", the video description information can be matched with each reference entity in a knowledge base, and whether the reference entity matched with the video description information exists in the knowledge base is determined. For example, the knowledge base includes "red", "red dragon" and other reference entities, and by means of a matching operation, it can be determined that the reference entities matched with the video description information include "red" and "red dragon", and then both "red" and "red dragon" can be used as the entities in the entity set of the initial video.

For example, if a video description information is "male lead team with love of love in the world programming team," the video description information may be matched with each reference entity in the knowledge base to determine whether there is a reference entity in the knowledge base that matches the video description information. For example, the knowledge base includes "loved" and other reference entities, and by means of a matching operation, it can be determined that the reference entities matched with the video description information include "loved" and then both "loved" and "loved" can be used as the entity in the entity set of the initial video.

In one implementation, after determining the entity set of any one of the initial videos, entity information corresponding to each of the entities in the entity set may be further determined. The knowledge base can also comprise entity attributes of each reference entity, and the number of the entity attributes of one reference entity can be one or more, wherein the reference entity and the entity attributes corresponding to the reference entity in the knowledge base can be stored in a knowledge graph mode, so that the corresponding entity attributes can be queried based on the reference entity quickly, and the data processing speed can be increased. These entity attributes may be used to uniquely represent the entity characteristics of an entity, and thus may distinguish between various ambiguous entities, e.g., to distinguish between various movie and television play entities, these entity attributes may refer to the year of play, the region of play, director, actors, scenario subjects, language, etc. What is needed to allow for the input of a text recognition model for subsequent use is natural language text, wherein the input includes entity information referred to herein, and the entity information for an entity can be constructed from the entity attributes of the entity. In order to represent the entity attributes through natural language, the entity attributes of an entity can be combined by setting preset attribute combination rules to obtain a natural language text, namely entity information.

Optionally, for any entity in the entity set of an initial video, one or more entity attributes corresponding to the any entity may be acquired from the knowledge base, and after the entity attributes of the any entity are acquired, the entity information of the any entity may be determined based on the entity attributes of the any entity. For example, one or more entity attributes and any entity may be combined according to a preset attribute combination rule, so as to obtain entity information of any entity. The preset attribute combination rule may be preconfigured, for example, the preset attribute combination rule may refer to adding various entity attributes and entities of an entity to corresponding entity information templates, and adding entity attributes and entity information templates after the entity may be used as entity information of the entity. In one embodiment, the entity information template may be regarded as being composed of an entity attribute template and an entity template, wherein the entity attribute template corresponds to an entity attribute and is used for adding the entity attribute; the entity templates correspond to entities for adding entity names. Optionally, entity attribute templates corresponding to different types of entity attributes may be different, and the entity attribute templates may be composed of one or more entity attribute sub-templates.

The following description first describes various entity attribute sub-templates, wherein entity attributes are exemplified by year of the presentation, region of the presentation, director, actor, scenario subject, and language.

S1, the year of the upper mapping: if the entity attribute contains a year of the upper-year, the corresponding entity attribute sub-template may be "{ year of upper-year } upper-year.

S2, mapping areas: if the entity attribute contains a remapped region, and the remapped region is not meaningless information such as "other", the corresponding entity attribute sub-template may be "remapped in the { remapped region }.

S3, director: if the entity attribute contains director information (such as director name), the corresponding entity attribute sub-template can be 'guided by director'; if the director information contains multiple directors, a word "etc" may be used "," connected and added.

S4, actors: if the entity attribute contains actor information (such as the name of an actor), the corresponding entity attribute sub-template can be "starring by { actor }, if the actor information contains multiple actors, a word of" etc. "can be used", "connected and added.

S5, scenario theme: if the entity attribute contains scenario theme (such as suspicion, scenario, action, etc.), the corresponding entity attribute sub-template can be "{ scenario subject } piece", and if there are multiple scenario subjects, it can be used and connected.

S6, language: if language information (e.g., mandarin, english, cantonese, etc.) is contained in the entity attribute, its corresponding entity attribute sub-template may be "{ language } version.

For entity attribute templates containing an entity attribute, the entity attribute templates are the entity attribute sub-templates; for entity attribute templates containing multiple entity attributes, the entity attribute templates can be formed by constructing the entity attribute sub-templates. For example, if the entity attribute includes a year of the upper map and an area of the upper map, the entity attribute template may be composed of the year of the upper map and the area of the upper map together, and the entity attribute template may be "{ year of the upper map } year of the upper map region }. As another example, if the entity attribute includes director and actor, the entity attribute template may be composed of director and actor together, and the entity attribute template may be "director of { actor } by { director }.

When the entity attribute templates containing various entity attributes are constructed, corresponding entity attribute sub-templates can be sequentially added back in the original entity attribute templates according to the sequence of the S1 to the S6. For example, if the entity attributes include a year of the offer, a director, a scenario subject, the entity attribute template may be a { scenario subject } piece guided by { director } for "{ year of the offer }. As another example, if the entity attribute includes a year of the presentation, a region of the presentation, a director, an actor, a scenario subject, a language, the entity attribute template may be "{ year of the presentation } year { scenario subject } sheet { language } version of { year of the presentation { region of the presentation } director { actor }.

For the entity template in the entity information template, the entity template may be added after the entity attribute template, for example, the entity information template may be "{ years of mapping } years { language } version { entity } of { scenario subject } piece { language } of { years { areas of mapping { directions } guiding { actors } lead actor). When the entity information template is utilized to generate corresponding entity information, an entity name of an entity is added to the entity template. For example, if an entity is a mountain sea situation, the mountain sea situation may be added to the entity template.

For example, taking several entities as an example for explanation, based on the above-mentioned method for generating entity information, entity information corresponding to the following entities may be generated:

(1) Entity information that the entity is a "friend in charge": in 2019, the Chinese characters are reflected by Xiang Er, qu, etc. to guide Wang Yi, three, two, wang San, one, four, two, hu Yi, five, li Erdeng loving friends of Mandarin edition of Chinese of the main actor.

(2) The entity is entity information of 'student age': the year 2021 was internally reflected by the age of Mandarin edition students who lead to the director, such as Huyi, li Yi, huyi, wang Yi, pond one, wang two, li Er, yang Wu.

(3) Entity information of "rainy night": in spanish in 2010 the image was shown, the first, second, third, first and fourth principles of Luo San, first, second, fifth, third, second, first and fourth principles of the first, second, third and fourth principles of the first, second and fourth principles of the third, third and fourth principles of the first, fourth and fifth principles of the second, third and fourth principles of the third, fourth and fourth principles of the second, and the fourth principles of the third, fourth and fourth principles.

(4) Entity information of "fighter" entity: in 1994, the Chinese characters were reflected in Japan, and Zhi Sanzhi, gao two, siLiu, shenwu, su four, sanshi, pan Wu, mu two, ju and the like were used for the action of the primary director, love, animation and fantasy Japanese fighter.

(5) Entity information of "evergreen tree": in 2014, the scenario of the main groups such as Anyi, zhan Wu, qiao two, ansan and Aiqi is shown in the United states, and the evergreen tree is shown in English version.

S203: and determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video so as to obtain an input text set corresponding to the initial video set.

The input text set may include one or more input texts, where one input text corresponds to one video description information and entity information of one entity.

In one implementation manner, it can be known from the foregoing that, in step S202, the conversion from the entity part in the input text to be input into the text recognition model to the natural language is completed, and then the conversion from the natural language is performed on the other part in the input text, so as to obtain the input text meeting the requirement of the text recognition model. Alternatively, the input text may be determined based on the video description information of each initial video and the entity information corresponding to each initial video, that is, an input text is constructed by the video description information of one initial video and one entity information corresponding to the initial video. For example, a video description information and an entity information may be combined according to preset information combination rules to generate an input text. The preset information combination rule may be preconfigured, for example, the preset information combination rule may refer to adding a video description information and an entity information to an information template, and the information template after adding the information is an input information. Based on the information template, the entity information and the video description information can be converted into natural language text, for example, the information template can be: the pieces of { entity information } { } { video description information }, i.e., the video description information and the entity information may be combined as described above to obtain the corresponding input text. The { } in the information template may be used to represent a category recognition result corresponding to the input text, that is, the category recognition needs to be performed based on the location later to obtain the category recognition result. In an actual video recommendation scene, the data of the position does not exist, and the data of the position is a category identification result obtained by category identification.

Then, through the information template described above, the video description information corresponding to any one of the initial videos and the entity information corresponding to each entity in the entity set of the initial video can be combined to obtain a plurality of input texts, and then the video description information of each of the initial videos and the entity information corresponding to the initial video are combined to obtain the input text set corresponding to the initial video set.

It should be noted that, the "segment" in the information template may be replaced by other data, which only needs to ensure that the input text generated finally is a smooth natural language text, for example, the "segment" may be replaced by "scenario" or other data.

S204: and carrying out category identification on each input text to obtain a category identification result of each input text.

In one implementation, a text recognition model may be invoked to perform category recognition on each input text to obtain a category recognition result for each input text. Wherein the category recognition result may be used to characterize whether the entity information is related to the video description information or not related to the video description information, that is, the category recognition result may include a correlation or not. The category recognition result may be determined based on data populated into the input text at a specified location, which may refer to: { entity information } { } in the section of { video description information }. When the correlation between the entity information and the video description information is described based on the information template, the data corresponding to the specified position may be words such as "presence/absence" or "yes/no", wherein positive words such as "presence", "yes" and the like may indicate that the entity information is correlated with the video description information, and negative words such as "no", "no" and the like may indicate that the entity information is not correlated with the video description information. And the data of the specified position in the input text is predicted by calling the text recognition model.

For example, if the predicted data for invoking the text recognition model is "yes" (or "yes"), the complete input text may be: the segment of { entity information } { has } { video description information }, based on the input text after data filling, the segment that the entity corresponding to the entity information contains the video description information can be seen, that is, the entity information is related to the video description information, and the corresponding category recognition result is related. As another example, if the predicted data is "none" (or "no") by invoking the text recognition model, the complete input text may be: the segment of { entity information } { (none } { (video description information }) can see that the entity corresponding to the entity information does not contain the segment of the video description information based on the input text after data filling, that is, the entity information is not related to the video description information, and the corresponding category identification result is not related.

The text recognition model may be obtained by retraining a pre-trained model, which may be a model that performs text recognition, for example, a BERT model, a RoBERTa model, or the like, and may also include other models, which are not illustrated herein.

S205: and determining target input text from the input text set, and associating initial video corresponding to the video description information included in the target input text with the search information.

Wherein, the target input text may include: the category recognition result is an input text of a target category recognition result, which may refer to: the entity information is related to the video description information.

In an implementation manner, if the application is applied to a video recommendation scene, an initial video corresponding to video description information included in a target input text may be determined as a video to be recommended, so as to perform video recommendation. It can be seen that the input text can be screened based on the category identification result of each input text, so that entity disambiguation can be realized, namely, the entity actually related to the initial video can be screened out from a plurality of entities, and therefore, higher-precision entity link can be achieved.

In one implementation, after the video to be recommended is obtained, the video to be recommended may be displayed on a video search interface. For example, the video to be recommended may be displayed in a result display area labeled 302 in FIG. 3 a.

Based on practice demonstration, the existing information processing method (such as keyword search) is compared with the information processing method in the application, and the information processing method in the application can obtain more accurate video recommendation, namely the obtained recommended video is usually a film and television fragment required by a user. See, e.g., fig. 3 b: as can be seen from fig. 3b, compared with the video marked with 305 obtained by the existing information processing method, the video can be filtered and more accurate video can be obtained.

In the embodiment of the application, search information input by a user can be acquired, an initial video set is determined based on the search information, video description information of each initial video can be acquired, and according to the video description information of each initial video, an entity set of each initial video and entity information of each entity in the entity set can be determined. Further, the input text may be determined based on the video description information of each initial video and the entity information corresponding to each initial video, so as to obtain an input text set corresponding to the initial video set. And then carrying out category recognition on each input text to obtain a category recognition result of each input text, and determining a target input text from the input text set so as to determine an initial video associated with the search information according to the target input text. By implementing the method, entities (concepts) such as movie drama and movie drama characters can be analyzed from the search information of the user, so that the entities are involved in the video content of the recall video obtained based on the search information, the relevance between the search information and the recall video is effectively enhanced, more accurate video recommendation can be realized conveniently, and the user experience is improved. Meanwhile, the entity can be uniquely represented based on various attribute characteristics, so that an entity disambiguation function is realized; the method is further based on the text formed by the entity part and the video description part to identify, and the entity which is truly related to the video can be screened from a plurality of similar entities, so that the inaccuracy problem caused by the similar entities when the video is recalled is avoided, the entity link with higher precision is achieved, and the method is applied to the video recommendation scene and can also improve the accuracy of video recommendation.

Referring to fig. 4, fig. 4 is a flow chart of an information processing method according to an embodiment of the present application. The information processing method described in the present embodiment, which can be executed by a computer device, which can be a terminal or a server; alternatively, the information processing method may be performed by a terminal and a server together. For convenience of explanation, the embodiment of the application will be described by taking a computer device to execute the information processing method as an example; it may comprise the following steps S401-S406:

s401: search information input by a user is acquired, and an initial video set is determined based on the search information.

S402: and acquiring video description information of each initial video, and determining an entity set of each initial video and entity information of each entity in the entity set according to the video description information of each initial video.

S403: and determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video so as to obtain an input text set corresponding to the initial video set.

S404: and acquiring a training text set, and training the pre-training model based on the training text set to obtain a text recognition model.

Wherein the training text set may comprise one or more training texts.

In one implementation, first, training video description information of each training video in a training video set may be acquired, so as to determine subsequent operations such as a training entity set based on the training video description information of each training video, thereby obtaining a training text set. The training videos in the training video set may be obtained from an existing video library, and the training video description information of the training videos may refer to the video description information of the initial video, which is not described herein. Considering that the training video description information of each training video in the training video set needs to be processed identically, any training video in the training video set will be described below as an example.

For any training video in the training video set, the training entity set of the any training video and the training entity information of each training entity in the training entity set can be determined according to the training video description information of the any training video. And then determining training texts based on training video description information of any training video and training entity information corresponding to each training video so as to obtain training text sets corresponding to the training video sets. Likewise, a training text corresponds to a training video description information and training entity information of a training entity. The method for determining the training entity set and the training entity information of each training entity in the training entity set may refer to the method for determining the entity information of each entity in the entity set and the entity set, which is not described herein.

In one implementation manner, when determining a training text, training video description information of any training video and training entity information corresponding to each training video may be combined according to a preset information combination rule, so as to obtain a training text, and the training text is added into a training text set. The preset information combination rule may refer to the above description, and the training text may also be generated based on the information template "{ entity information } { } { video description information } fragment" in step S204. Unlike the information template in step S204, the information template corresponding to the model training phase may specifically be "{ training entity information } { presence/absence } { training video description information } fragment", where the data in the specified position { presence/absence } may be "presence" or "absence", and the data corresponding to the specified position indicates the training class recognition result of each training text. For example, if the data corresponding to the designated location is "yes", the training class recognition result of the training text is relevant (i.e., the training entity information is relevant to the training video description information); if the data corresponding to the designated position is 'none', the training class identification result of the training text is irrelevant (namely, the training entity information is irrelevant to the training video description information).

Optionally, the training class recognition result of each training text may be manually marked, that is, the training class recognition result corresponding to each training text may be predetermined, and the data corresponding to the training class recognition result may be filled into the designated position of the training text. If the training class recognition result of the training text is relevant, the data of the designated position of the training text is 'Yes'; if the training class recognition result of the training text is irrelevant, the data of the designated position in the training text is 'none'.

As can be seen from the above, by the above method, the training text set corresponding to the training video set can be obtained based on the information template of { training entity information } { (presence/absence } { (training video description information } clip). Optionally, to enhance the robustness of the text recognition model, the information templates may also be enriched to obtain more training text sets for training the pre-training model based on the diverse information templates. For example, the specific operation of enriching the information template may be synonymous transformation of related data in the information template to obtain a new information template, so that a training sample set may be obtained based on the information template and the new information template. Such as "clips" in the information template may be replaced with "episodes", "have" replaced with "yes", "no" replaced with "no", etc. In one embodiment, training video description information of any training video and training entity information corresponding to each training video may be combined according to a preset information combination rule, so as to obtain an initial training text. And after the initial training text is obtained, carrying out synonymous conversion on the target data in the initial training text to obtain the converted training text of the initial training text. The target data may refer to related data utilized in the conversion of the information template into the new information template through synonymous conversion, for example, the target data may be "fragments", "presence" and "absence". Then, after the initial training text and the conversion training text corresponding to the initial training text are obtained, the initial training text and the conversion training text can be used as training texts and added into a training text set.

In one implementation, after a training text set is obtained, the pre-training model may be trained based on the training text set to obtain a text recognition model. The pre-training model is a MASK language model (Masked Language Modeling, MLM), which may be, for example, a BERT model, a RoBERTa model, etc., which may MASK (MASK) certain words in text and then use the model to predict the MASK words. For example, in this application, the category recognition process for input text, which is a combination of video descriptive information and entity information, may be understood as a classification task, and output a classification result (i.e., category recognition result: relevant or irrelevant). In order to fully utilize the data processing mode of the pre-training model, the classification task can be converted into a task of realizing the complete filling by using a mask language model, namely, the input data corresponding to the class identification result and the model are converted into a section of natural language text by using a template, words corresponding to the class identification result are hidden (namely, mask) in the natural language text, and then hidden words (namely, words to be hidden) can be predicted by using the mask language model, and then the corresponding class identification result can be determined according to the prediction result.

It will be appreciated that the pre-training model is typically trained using a large-scale corpus that covers news, community questions and answers, a plurality of encyclopedia data, etc., and that in order to make the pre-training model applicable to the text recognition tasks in the present application, the pre-training model may be fine-tuned to obtain the text recognition model in the present application. Wherein, the fine tuning process may refer to training the pre-training model with a training text set. Considering that the pretraining models such as BERT and the like all take a language model as a training task, the original task of the pretraining model can be more fitted by converting the recognition task of the video into the recognition task of the text, and further the model can be trained by using a small quantity of training text sets, so that a great amount of manual marking cost caused by constructing the training text sets can be saved. Practice shows that under the model effect of entity disambiguation with the same precision as that of the traditional scheme, only 5% of labeling samples in the traditional scheme can be needed during model training of the application.

In one implementation, when training the pre-training model, the designated position in each training text needs to be subjected to masking processing, and after masking processing, training probability sets corresponding to all reference data in a reference dictionary associated with the pre-training model can be obtained, so that the pre-training model is trained based on the training probability sets corresponding to all the reference data, and a text recognition model is obtained. The training probability set includes: the training occurrence probability of each reference data in the reference dictionary at a specified position. The reference dictionary may include a large number of reference data, and all the reference data have a fixed order, that is, the position of each reference data in the reference dictionary is fixed, and when training is performed by using the pre-training model, the obtained order of the occurrence probability of training in the training probability set corresponding to each reference data is also consistent with the fixed order, that is, after the training probability set is obtained by using the pre-training model, the occurrence probability of training corresponding to each reference data in the reference dictionary may be determined according to the position corresponding to each training occurrence probability in the training probability set and the position of each reference data in the reference dictionary. Based on this, optionally, the text corresponding to each reference data in the reference dictionary may be converted into a data identifier, where the data identifier may be used to uniquely indicate one reference data, and then, after the training probability set is obtained, the reference data corresponding to the training occurrence probability may be determined according to the data identifier corresponding to the training occurrence probability in the training probability set. The data identifier may be determined according to the position of the reference data in the reference dictionary, for example, each reference data may be numbered sequentially according to the position sequence of each reference data in the reference dictionary, and the number corresponding to each reference data may be the corresponding data identifier.

For example, reference data in a reference dictionary may be understood as reference words. Let the reference dictionary A be [ A ] ₁ 、A ₂ 、…、A _k 、…、A _n ]The training probability set p is [ p ] ₁ 、p ₂ 、…、p _k 、…、p _n ]Wherein A is _n Each reference word in the reference dictionary A is represented, and the subscript n represents the reference word A _n In the position of the reference dictionary A, i.e. A _n In the nth position of the reference dictionary A, p _n Represents each training occurrence probability in the training probability set p, and the subscript n represents the training occurrence probability p _n At the position of training probability set p, i.e. p _n At the nth position of the training probability set p. It will be appreciated that each reference word in the reference dictionary corresponds to each training occurrence probability in the training probability set one-to-one, i.e. the training occurrence probability corresponding to a reference word at a certain position in the reference dictionary is the training occurrence probability at that position in the training probability set. For example, with respect to reference word A2 in the reference dictionary, it is known that the reference wordWhen the position in the reference dictionary is at the second position, the training occurrence probability of the reference word is determined from the training probability set, and only the training occurrence probability at the second position in the training probability set is required to be determined, wherein the training occurrence probability at the second position is the training occurrence probability corresponding to the reference word, namely the training occurrence probability of the reference word A2 is p2.

For a better understanding of the model training method provided in the embodiments of the present application, the following is further described with reference to the flowchart shown in fig. 5. The pre-training model is described by taking the RoBERTa model as an example. As shown in fig. 5, before training the pre-training model, a training sample set corresponding to the pre-training model needs to be predetermined, and the following description will mainly refer to generation of a training text. Firstly, training video description information of each training video in a training video set can be acquired, and then, aiming at any training video in the training video set, a training entity set of any training video is determined according to the training video description information of any training video. The training video description information of any training video can be matched with a plurality of reference entities in the knowledge base, and the matched reference entities are determined as the training entities of any training video, so that a training entity set is obtained. After the training entity set is obtained, entity attributes corresponding to each training entity can be obtained based on the knowledge base. For example, referring to FIG. 5, entity attributes for which an entity is "senior" include: actors (guan one, ku three, ying five), director (Feng Er), year of the showing (2001), scenario subject (comedy), area of showing (inner land), language (Mandarin).

Further, the obtained entity attribute and the training entity are combined to obtain training entity information of the training entity. For example, the training entity information is "overground in 2001", feng Er directs the praise of comedy of the main director of first, third and fifth, etc. By the method, the conversion from the entity input part of the pre-training model to the natural language can be completed, and then the other input parts corresponding to the pre-training model are subjected to natural language. In a specific implementation, the training video description information and the training entity information may be combined to obtain a section of natural language text, where the natural language text is a training text. For example, the training video description information and the training entity information may be combined according to a preset information combination rule. The training information template corresponding to the preset information combination rule may be a segment of { training entity information } { presence/absence } { training video description information }, where "presence/absence" in "{ presence/absence }" is used to indicate a label corresponding to the training text (i.e., training class recognition result). In order to enhance the robustness of the model, the training information templates can be enriched, for example, the information in the templates can be subjected to synonym replacement to generate a new training information template, for example, the "fragments" are replaced by "episodes", "the" have "is replaced by" yes ", and" no "is replaced by" no ". Through the method, the training text set can be obtained, and the data used for representing the training type recognition result in the training text can be replaced by the [ MASK ] in consideration of the data processing mode (MASK processing) of the pre-training model, so that the pre-training model is convenient to train.

And then inputting the processed training text set into a pre-training model (RoBERTa model), and continuing to train the MLM task of the original pre-training model. The specific steps can be as follows:

a. initializing the RoBERTa model, i.e. loading the RoBERTa model to initialize the model parameters in the RoBERTa model.

b. And converting each piece of reference data in the reference dictionary associated with the pre-training model into a data identification so that the training occurrence probability of each piece of reference data can be determined according to the training occurrence probability corresponding to the data identification.

c. The training text set after processing is utilized to perform fine tuning (training) on the pre-training model, and the learning rate corresponding to the pre-training model can be set smaller in consideration of the fine tuning of the training, for example, the initial learning rate can be set to be 1e ^-5 。

S405: and calling a text recognition model to perform category recognition on each input text, and obtaining a category recognition result of each input text.

In one implementation, taking any input text in the input text set as an example, for any input text in the input text set, masking processing can be performed on a designated position in any input text by using the text recognition model, so as to obtain a probability set corresponding to all reference data in the text recognition model-associated reference dictionary. The data corresponding to the appointed position in any input text can be used for indicating the category identification result of any input text; the probability set may include: the occurrence probability of each reference data in the reference dictionary at a specified position. Then, data corresponding to the specified location in any of the input text may be determined based on the set of probabilities. Alternatively, it is contemplated that in the present application, the input text is subjected to category recognition, and the category recognition result may include that the entity information in the input text is related to the video description information or the entity information is not related to the video description information, and that the affirmative terms such as "have", "yes" or the like may indicate that the entity information is related to the video description information, and that the negative terms such as "none", "no" or the like may indicate that the entity information is not related to the video description information. Then, reference data usable in the reference dictionary to indicate the category recognition result may be defined in advance, the reference data may be referred to as specified data, and then the category recognition result of the arbitrary input text may be determined based on the occurrence probability corresponding to the specified data. For example, the specified data may be data for displaying the classification result, such as "yes" and "no", or "have" and "no", or other data, without limitation in this application.

Then, after the probability sets corresponding to all the reference data in the reference dictionary are obtained, the probability of occurrence of the specified data can be determined from the probability sets. Optionally, as shown above, each reference data corresponds to a fixed position in the reference field dictionary, and the position of each reference data in the probability set is consistent with the position of each reference data in the reference field dictionary, so that the reference data corresponding to the probability of occurrence can be determined based on the position of each probability of occurrence in the probability set, and the probability of occurrence corresponding to each specified data can be determined. Alternatively, if the data identifier corresponding to each reference data in the reference dictionary is predefined, the specified data identifier corresponding to each specified data may be determined first, then the probability of occurrence corresponding to each specified data identifier may be obtained from the probability set, and the obtained probability of occurrence is the probability of occurrence corresponding to the specified data. Further, after the occurrence probabilities of the plurality of pieces of specified data are obtained from the probability set, the maximum occurrence probability may be determined from the occurrence probabilities of the plurality of pieces of specified data, and the specified data corresponding to the maximum occurrence probability may be used as the category recognition result of any input text. For example, assume that the specified data includes: yes and no. If the occurrence probability of yes is greater than that of no, the category identification result of any input text is relevant; if the occurrence probability of "yes" is smaller than that of "no", the category recognition result of any input text is irrelevant.

S406: and determining target input text from the input text set, and associating initial video corresponding to the video description information included in the target input text with the search information.

The specific implementation of steps S401 to S403 and step S406 may be referred to the specific description of steps S201 to S203 and step S205 in the above embodiments, and will not be repeated here.

In the embodiment of the application, search information input by a user can be acquired, and an initial video set is determined based on the search information; then, the video description information of each initial video may be acquired, and the entity set of each initial video and the entity information of each entity in the entity set may be determined according to the video description information of each initial video. And then determining an input text based on the video description information of each initial video and the entity information corresponding to each initial video so as to obtain an input text set corresponding to the initial video set. The training text set can be obtained, and the pre-training model is trained based on the training text set to obtain the text recognition model. Therefore, the text recognition model can be called to recognize the category of each input text, and the category recognition result of each input text is obtained. Further, a target input text may be determined from the set of input texts, and an initial video corresponding to the video description information included in the target input text may be associated with the search information. By implementing the method, the entity part and the video description part which are associated with the video can be formed into the text, and the text recognition model is called to carry out text recognition on the text so as to recall the video according to the text recognition result, so that the text recognition model obtained by the pre-training model can be utilized to convert the video recall task into the text recognition task. In the training process of the pre-training model, the related information of the video can be converted into natural language, so that the finally obtained training text set can be more fit with the original task of the pre-training model when the pre-training model is trained, and the finally obtained text recognition model can achieve a better text recognition effect by using fewer training text sets. Meanwhile, a great deal of manual labeling cost caused by constructing the training set can be saved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application. The information processing apparatus described in the present embodiment includes:

an obtaining unit 601, configured to obtain search information input by a user, and determine an initial video set based on the search information, where the initial video set includes one or more initial videos;

a first determining unit 602, configured to obtain video description information of each initial video, and determine, according to the video description information of each initial video, an entity set of each initial video and entity information of each entity in the entity set;

a second determining unit 603, configured to determine an input text based on the video description information of each initial video and the entity information corresponding to each initial video, so as to obtain an input text set corresponding to the initial video set; the input text set comprises one or more input texts, wherein one input text corresponds to one video description information and entity information of one entity;

a recognition unit 604, configured to perform category recognition on each input text, so as to obtain a category recognition result of each input text;

an associating unit 605, configured to determine a target input text from the input text set, and associate an initial video corresponding to video description information included in the target input text with the search information; the target input text includes: the category recognition result is the input text of the target category recognition result.

In one implementation, the first determining unit 602 is specifically configured to:

matching video description information of any initial video with a plurality of reference entities in a knowledge base aiming at any initial video in the initial video set, and adding the matched reference entities into an entity set of any initial video, wherein the entity set comprises one or more entities;

and constructing entity information of each entity by utilizing entity attributes of the entities in the knowledge base.

for any entity in the entity set, acquiring one or more entity attributes corresponding to the any entity from a knowledge base;

and combining the one or more entity attributes and any entity according to a preset attribute combination rule to obtain entity information of any entity.

In one implementation, the apparatus further includes a training unit 606, where the training unit 606 is specifically configured to:

acquiring a training text set, training a pre-training model based on the training text set to obtain a text recognition model, wherein the training text set comprises one or more training texts;

The identifying unit 604 is specifically configured to:

and calling the text recognition model to perform category recognition on each input text, and obtaining a category recognition result of each input text.

In one implementation, the identifying unit 604 is specifically configured to:

masking a designated position in any input text by using a text recognition model aiming at any input text in the input text set to obtain a probability set corresponding to all reference data in a text recognition model-associated reference dictionary; the data corresponding to the appointed position of any input text is used for indicating a category identification result of the any input text, and the probability set comprises: the occurrence probability of each reference datum in the reference dictionary at the appointed position;

and acquiring the occurrence probabilities of a plurality of pieces of specified data from the probability set, and taking the specified data corresponding to the maximum occurrence probability as a category identification result of any input text.

In one implementation, the training unit 606 is specifically configured to:

acquiring training video description information of each training video in a training video set;

aiming at any training video in the training video set, determining a training entity set of the any training video and training entity information of each training entity in the training entity set according to training video description information of the any training video;

Determining training texts based on training video description information of any training video and training entity information corresponding to each training video to obtain training text sets corresponding to the training video sets; the data corresponding to the appointed position of each training text is used for indicating the training category recognition result of each training text, and one training text corresponds to one training video description information and one training entity information of one training entity.

In one implementation, the training unit 606 is specifically configured to:

according to a preset information combination rule, combining training video description information of any training video with training entity information corresponding to each training video to obtain an initial training text;

performing synonymous conversion on target data in the initial training text to obtain a conversion training text of the initial training text;

and taking the initial training text and the converted training text of the initial training text as training texts, and adding the training texts into a training text set.

It will be appreciated that the division of the units in the embodiments of the present application is illustrative, and is merely a logic function division, and other division manners may be actually implemented. Each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device described in the present embodiment includes: a processor 701, a memory 702 and a network interface 703. Data may be interacted between the processor 701, the memory 702, and the network interface 703.

The processor 701 may be a central processing unit (Central Processing Unit, CPU) which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 702 may include read only memory and random access memory and provides program instructions and data to the processor 701. A portion of the memory 702 may also include non-volatile random access memory. Wherein the processor 701, when calling the program instructions, is configured to execute:

In one implementation, the processor 701 is specifically configured to:

The embodiment of the application further provides a computer storage medium, in which program instructions are stored, where the program may include some or all of the steps of the information processing method in the corresponding embodiment of fig. 2 or fig. 4 when executed.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps performed in the embodiments of the methods described above.

The foregoing has described in detail the methods, apparatuses, computer devices and storage medium for processing information provided in the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An information processing method, characterized by comprising:

2. The method of claim 1, wherein determining the set of entities for each initial video and the entity information for each entity in the set of entities based on the video description information for each initial video comprises:

3. The method of claim 2, wherein said constructing entity information for each entity using entity attributes of entities in the knowledge base comprises:

4. The method according to claim 1, wherein the step of performing category recognition on each input text includes, before obtaining the category recognition result of each input text:

the step of carrying out category identification on each input text to obtain a category identification result of each input text, which comprises the following steps:

5. The method of claim 4, wherein said invoking the text recognition model to perform category recognition on each input text to obtain a category recognition result for each input text comprises:

6. The method of claim 4, wherein the obtaining a training text set comprises:

7. The method of claim 6, wherein determining training text based on the training video description information of any one of the training videos and the training entity information corresponding to each of the training videos to obtain a training text set comprises:

8. An information processing apparatus, characterized by comprising:

9. A computer device comprising a processor, a memory and a network interface, the processor, the memory and the network interface being interconnected, wherein the memory is adapted to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-7.