CN110519636B

CN110519636B - Voice information playing method and device, computer equipment and storage medium

Info

Publication number: CN110519636B
Application number: CN201910831934.2A
Authority: CN
Inventors: 陈姿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2021-12-21
Anticipated expiration: 2039-09-04
Also published as: CN110519636A

Abstract

The embodiment of the invention discloses a voice information playing method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: identifying image information of a video to obtain first playing data, and obtaining second playing data according to the first playing data, wherein the first playing data comprise text information used for describing the image information, and the second playing data comprise target voice information obtained by converting the text information; and when a voice playing instruction for the video is received, playing the target voice information. The target voice information is obtained through text information conversion, and the text information is used for describing the image information, so that the target voice information can describe the image information of the video, when a user cannot normally watch the image information, the content of the image information of the video can be known by listening to the target voice information, the information quantity in the image information is obtained, the information quantity obtained by the user is increased, and the video playing efficiency is improved.

Description

Voice information playing method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice information playing method and device, computer equipment and a storage medium.

Background

With the rapid development of computer technology and the wide popularization of intelligent devices, the video industry has gradually emerged, and playing videos becomes a common entertainment form in the leisure state of users and is favored by the majority of users.

The video comprises image information and voice information, and when the video is played, the image information can be displayed, and the voice information can be played synchronously. The user can watch the image information and listen to the voice information at the same time. However, the visually impaired user cannot normally watch image information, can only listen to voice information, and the amount of information obtained is small, which results in a decrease in video playing efficiency.

Disclosure of Invention

The embodiment of the invention provides a voice information playing method and device, computer equipment and a storage medium, which can improve the information content of voice information and improve the video playing efficiency. The technical scheme is as follows:

in one aspect, a method for playing voice information is provided, where the method includes:

identifying image information of a video to obtain first playing data, wherein the first playing data comprise text information used for describing the image information;

acquiring second playing data according to the first playing data, wherein the second playing data comprises target voice information obtained by converting the text information;

and when a voice playing instruction for the video is received, playing the target voice information according to the second playing data.

Optionally, when receiving a voice playing instruction for the video, before playing the target voice information according to the second playing data, the method further includes:

displaying a playing interface of the video, wherein the playing interface comprises a voice playing option, and when the triggering operation of the voice playing option is detected, the voice playing instruction is determined to be received; alternatively, the first and second electrodes may be,

receiving input voice information, and determining to receive the voice playing instruction when the voice information contains a voice playing keyword.

Optionally, the method further comprises:

when the voice playing instruction is received under the condition that the playing interface of the video is displayed, displaying prompt information, wherein the prompt information is used for prompting a user to close the playing interface of the video;

and when a confirmation instruction of the prompt message is received, closing the playing interface.

Optionally, the identifying the image information of the video to obtain the first playing data includes:

identifying a target object in the image information to obtain an object identifier belonging to the target object and a corresponding appearance time period, and taking the object identifier and the corresponding appearance time period as the first playing data;

the acquiring of the second playing data according to the first playing data includes:

and converting the object identifier into target voice information, and taking the target voice information and the corresponding occurrence time period as the second playing data.

Optionally, the identifying the target object in the image information to obtain an object identifier and a corresponding appearance time period belonging to the target object, and taking the object identifier and the corresponding appearance time period as the first playing data includes:

identifying at least two of the characters, the backgrounds or the actions in the image information respectively to obtain at least two playing data items, wherein each playing data item comprises an object identifier belonging to the same target object and a corresponding occurrence time period;

and combining the object identifications corresponding to the same appearance time period in the at least two playing data items according to a preset sentence pattern structure to form text information, and taking the text information and the corresponding appearance time period as the first playing data.

Optionally, the identifying the target object in the image information to obtain an object identifier belonging to the target object and a corresponding appearance time period, and taking the object identifier and the corresponding appearance time period as the first playing data includes:

identifying the face in the image information to obtain face characteristics and corresponding appearance time periods;

acquiring a figure identifier corresponding to the human face feature based on a figure division model;

and taking the figure identification corresponding to the face features and the corresponding appearing time period as the first playing data.

Optionally, the character segmentation model comprises a plurality of character segmentation submodels, each character segmentation submodel having a corresponding character identifier;

the obtaining of the figure identifier corresponding to the face feature based on the figure division model includes:

respectively acquiring classification identifications of the face features based on the plurality of character classification submodels, wherein the classification identifications comprise first identifications or second identifications, the first identifications represent that the face features are matched with characters corresponding to the character classification submodels, and the second identifications represent that the face features are not matched with the characters corresponding to the character classification submodels;

and when the classification identification obtained based on any character classification submodel is the first identification, taking the character identification corresponding to any character classification submodel as the character identification corresponding to the human face characteristic.

Optionally, before obtaining the person identifier corresponding to the face feature based on the person classification model, the method further includes:

acquiring a plurality of pieces of character characteristic information of the video, wherein each piece of character characteristic information comprises a character identification and a plurality of face images matched with the character identification;

and training a character classification submodel according to the plurality of face images in each piece of character feature information.

Optionally, the obtaining second playing data according to the first playing data includes:

acquiring a preset label of the video;

determining a voice generation model corresponding to the preset label in a model database, wherein the model database comprises a plurality of voice generation models and corresponding preset labels;

converting the text information into the target voice information based on the determined voice generation model.

Optionally, the second playing data includes the target voice information and a corresponding time period of occurrence; the playing the target voice message according to the second playing data includes:

and sequentially playing the target voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

Optionally, the video includes a first video segment and a second video segment, the first video segment includes image information but not original voice information, and the second video segment includes image information and original voice information;

the first playing data comprises text information used for describing the image information of the first video clip and a corresponding occurrence time period;

the second playing data comprises target voice information obtained by converting the text information and a corresponding occurrence time period, and original voice information of the second video clip and a corresponding occurrence time period;

the playing the target voice message according to the second playing data includes: and sequentially playing the voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

In another aspect, another method for playing voice information is provided, the method including:

receiving a voice playing instruction of the terminal to the video;

and sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information.

Optionally, the second playing data includes the target voice information and a corresponding time period of occurrence; the sending the target voice information to the terminal according to the second playing data comprises:

and sequentially sending the target voice information corresponding to each occurrence time period to the terminal according to the sequence of each occurrence time period so that the terminal can sequentially play the target voice information corresponding to each occurrence time period.

In another aspect, a voice message playing apparatus is provided, the apparatus including:

the identification module is used for identifying image information of a video to obtain first playing data, and the first playing data comprise text information used for describing the image information;

the acquisition module is used for acquiring second playing data according to the first playing data, wherein the second playing data comprises target voice information obtained by converting the text information;

and the playing module is used for playing the target voice information according to the second playing data when receiving a voice playing instruction of the video.

Optionally, the apparatus further comprises:

the display module is used for displaying a playing interface of the video, the playing interface comprises a voice playing option, and when the triggering operation of the voice playing option is detected, the voice playing instruction is determined to be received; alternatively, the first and second electrodes may be,

and the receiving module is used for receiving the input voice information, and determining to receive the voice playing instruction when the voice information contains a voice playing keyword.

Optionally, the apparatus further comprises:

the display module is used for displaying prompt information when receiving the voice playing instruction under the condition of displaying the video playing interface, wherein the prompt information is used for prompting a user to close the video playing interface;

and the closing module is used for closing the playing interface when receiving a confirmation instruction of the prompt message.

Optionally, the identification module includes:

the identification unit is used for identifying a target object in the image information to obtain an object identifier belonging to the target object and a corresponding appearance time period, and the object identifier and the corresponding appearance time period are used as the first playing data;

the acquisition module includes:

and the conversion unit is used for converting the object identifier into target voice information and taking the target voice information and the corresponding occurrence time period as the second playing data.

Optionally, the target object includes at least two of a person, a background, or an action, and the identifying unit includes:

the object identification subunit is used for respectively identifying at least two of the people, the backgrounds or the actions in the image information to obtain at least two playing data items, and each playing data item comprises an object identifier belonging to the same target object and a corresponding occurrence time period;

and the combination subunit is used for combining the object identifications corresponding to the same appearance time period in the at least two playing data items according to a preset sentence pattern structure to form text information, and taking the text information and the corresponding appearance time period as the first playing data.

Optionally, the target object includes a person, and the identifying unit includes:

the face identification subunit is used for identifying the face in the image information to obtain face characteristics and corresponding appearance time periods;

the acquiring subunit is used for acquiring a figure identifier corresponding to the human face feature based on a figure division model;

and the determining subunit is configured to use the person identifier corresponding to the face feature and the corresponding occurrence time period as the first playing data.

the obtaining subunit is configured to:

Optionally, the apparatus further comprises:

the characteristic acquisition module is used for acquiring a plurality of pieces of character characteristic information of the video, wherein each piece of character characteristic information comprises a character identification and a plurality of face images matched with the character identification;

and the training module is used for training a character division submodel according to the plurality of face images in each piece of character characteristic information.

Optionally, the obtaining module includes:

the preset label acquiring unit is used for acquiring a preset label of the video;

the model determining unit is used for determining a voice generation model corresponding to the preset label in a model database, wherein the model database comprises a plurality of voice generation models and corresponding preset labels;

a conversion unit, configured to convert the text information into the target speech information based on the determined speech generation model.

Optionally, the second playing data includes the target voice information and a corresponding time period of occurrence; the playing module comprises:

and the playing unit is used for sequentially playing the target voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

the playing module comprises:

and the playing unit is used for sequentially playing the voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

In another aspect, another apparatus for playing back voice information is provided, the apparatus comprising:

the receiving module is used for receiving a voice playing instruction of the terminal to the video;

and the sending module is used for sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information conveniently.

Optionally, the second playing data includes the target voice information and a corresponding time period of occurrence; the sending module comprises:

and the sending unit is used for sending the target voice information corresponding to each occurrence time period to the terminal in sequence according to the sequence of each occurrence time period so that the terminal can play the target voice information corresponding to each occurrence time period in sequence.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the instruction, the program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the operations as performed in the voice information playing method.

In yet another aspect, a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, loaded by a processor and having an operation to implement the voice information playing method is provided.

The method, the device, the computer equipment and the storage medium provided by the embodiment of the invention identify the image information of the video to obtain first playing data, and obtain second playing data according to the first playing data, wherein the first playing data comprises text information used for describing the image information, and the second playing data comprises target voice information obtained by converting the text information; and when a voice playing instruction for the video is received, playing the target voice information according to the second playing data. The target voice information is obtained through text information conversion, and the text information is used for describing the image information, so that the target voice information can describe the image information of the video, when a user cannot normally watch the image information, the content in the image information of the video can be known by listening to the target voice information, the information quantity in the image information is obtained, the information quantity obtained by the user is increased, and the video playing efficiency is improved. When the video is a movie or a television series and other movie works, the vision-impaired user or the user who is currently in a scene inconvenient to watch the video pictures can know the information content in the pictures of the movie and television works by listening to the target voice information, enjoy rich viewing experience of the movie and television works, and enable the user to have a more substituted feeling.

Moreover, because the user can trigger the voice playing instruction when the user is inconvenient to watch the image information of the video, the image information of the video is not concerned at the moment. Therefore, when the voice playing instruction is received, the playing interface is closed, and unnecessary loss of the memory and the electric quantity of the terminal can be avoided.

And acquiring a preset label of the video, determining a voice generation model corresponding to the preset label in a model database, wherein the model database comprises a plurality of voice generation models and corresponding preset labels, and converting the text information into target voice information based on the determined voice generation model. The corresponding voice generation model is determined through the preset label of the video, and based on the voice generation model, the style of the generated target voice information is the same as that of the video, so that the target voice information can describe the image information of the video more fully, and a user can understand the content in the image information through the target voice information conveniently.

And by receiving the input voice information, when the voice information contains a voice playing keyword, determining that a voice playing instruction is received. The user can instruct the terminal to play the target voice information by inputting the voice information without triggering operation on a display screen, so that the operation of the user is simplified, and the operability is improved.

And when the video comprises a first video segment and a second video segment, the first video segment comprises image information but not original voice information, and the second video segment comprises image information and original voice information, the second playing data comprises target voice information obtained by converting text information and corresponding occurrence time periods, and the original voice information of the second video segment and corresponding occurrence time periods.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the invention.

Fig. 2 is a flowchart of a method for playing voice information according to an embodiment of the present invention.

Fig. 3 is a flowchart of another method for playing voice information according to an embodiment of the present invention.

Fig. 4 is a flowchart of another method for playing voice information according to an embodiment of the present invention.

FIG. 5 is a flow chart illustrating a process for determining a speech generation model according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a playing interface according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of another playing interface provided in the embodiment of the present invention.

Fig. 8 is a flowchart of playing video by voice according to an embodiment of the present invention.

Fig. 9 is a flowchart of acquiring second playing data according to an embodiment of the present invention.

Fig. 10 is a flowchart of action recognition according to an embodiment of the present invention.

Fig. 11 is a flowchart of person identification according to an embodiment of the present invention.

Fig. 12 is a flowchart of another person identification method according to an embodiment of the present invention.

Fig. 13 is a flowchart of another person identification method according to an embodiment of the present invention.

Fig. 14 is a schematic structural diagram of a voice information playing apparatus according to an embodiment of the present invention.

Fig. 15 is a schematic structural diagram of another audio information playback device according to an embodiment of the present invention.

Fig. 16 is a schematic structural diagram of another audio information playback device according to an embodiment of the present invention.

Fig. 17 is a schematic structural diagram of another audio information playback device according to an embodiment of the present invention.

Fig. 18 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Fig. 19 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides a voice information playing method, and relates to an artificial intelligent natural language processing technology.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The following describes a voice information playing method provided by an embodiment of the present invention based on a natural language processing technology.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention, where the implementation environment includes a terminal 101 and a server 102.

The terminal 101 may be a mobile phone, a computer, a tablet computer, a smart television, and other types of devices, and the server 102 may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center.

The terminal 101 and the server 102 are connected through a network, and functions such as video playing and the like can be realized through interaction between the terminal 101 and the server 102. The embodiment of the invention provides a voice information playing method, which can convert image information in a video into target voice information capable of describing the image information, and play the video according to a voice form by playing the target voice information.

In one possible implementation, the method is applied in the terminal 101. The terminal 101 may identify image information of a video to obtain first playing data, and obtain second playing data according to the first playing data, where the second playing data includes target voice information, and the target voice information can describe the image information in the video in a voice form. When the terminal 101 receives the voice playing instruction for the video, the target voice information in the second playing data is played according to the second playing data.

In another possible implementation manner, the method is applied to the terminal 101 and the server 102. The server 102 stores one or more videos, and may identify image information of any video to obtain first playing data, and obtain second playing data according to the first playing data, where the second playing data includes target voice information, and the target voice information may describe the image information in the video in a voice form. Then, when receiving a voice playing instruction of the terminal to the video, the server 102 sends the target voice information corresponding to the video to the terminal 101 according to the second playing data, so that the terminal 101 can play the target voice information.

Fig. 2 is a flowchart of a method for playing voice information according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal, and referring to fig. 2, the method includes:

201. and identifying the image information of the video to obtain first playing data.

Wherein the first playback data includes text information for describing the image information.

The video may be downloaded from a server by the terminal, or captured by the terminal, or may be video from other sources.

202. And acquiring second playing data according to the first playing data.

And the second playing data comprises target voice information converted from the text information.

203. And when a voice playing instruction for the video is received, playing the target voice information according to the second playing data.

The method provided by the embodiment of the invention comprises the steps of identifying image information of a video to obtain first playing data, and obtaining second playing data according to the first playing data, wherein the first playing data comprise text information used for describing the image information, and the second playing data comprise target voice information obtained by converting the text information; and when a voice playing instruction for the video is received, playing the target voice information according to the second playing data. The target voice information is obtained through text information conversion, and the text information is used for describing the image information, so that the target voice information can describe the image information of the video, when a user cannot normally watch the image information, the content in the image information of the video can be known by listening to the target voice information, the information quantity in the image information is obtained, the information quantity obtained by the user is increased, and the video playing efficiency is improved. When the video is a movie or a television series and other movie works, the vision-impaired user or the user who is currently in a scene inconvenient to watch the video pictures can know the information content in the pictures of the movie and television works by listening to the target voice information, enjoy rich viewing experience of the movie and television works, and enable the user to have a more substituted feeling.

Fig. 3 is a flowchart of another method for playing voice information according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a server, and referring to fig. 2, the method includes:

301. and identifying the image information of the video to obtain first playing data.

The video can be downloaded from other devices by the server, or uploaded to the server by one or more terminals, or stored in the server by maintenance personnel, or shot by the server through a camera device, or can be videos of other sources.

302. And acquiring second playing data according to the first playing data.

303. And receiving a voice playing instruction of the terminal to the video.

And the voice playing instruction is used for indicating the server to send the target voice information of the video to the terminal.

304. And sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information.

The method provided by the embodiment of the invention comprises the steps of identifying image information of a video to obtain first playing data, and obtaining second playing data according to the first playing data, wherein the first playing data comprise text information used for describing the image information, and the second playing data comprise target voice information obtained by converting the text information; and when a voice playing instruction of the terminal to the video is received, sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information conveniently. The target voice information is obtained through text information conversion, and the text information is used for describing the image information, so that the target voice information can describe the image information of the video, when a user cannot normally watch the image information, the content in the image information of the video can be known by listening to the target voice information, the information quantity in the image information is obtained, the information quantity obtained by the user is increased, and the video playing efficiency is improved. When the video is a movie or a television series and other movie works, the vision-impaired user or the user who is currently in a scene inconvenient to watch the video pictures can know the information content in the pictures of the movie and television works by listening to the target voice information, enjoy rich viewing experience of the movie and television works, and enable the user to have a more substituted feeling.

Fig. 4 is a flowchart of another method for playing voice information according to an embodiment of the present invention. The interaction subject of the embodiment of the invention is a server and a terminal, referring to fig. 4, the method comprises the following steps:

401. the server identifies the image information of the video to obtain first playing data.

In addition, the video may be any type of video, such as a movie-type video, an entertainment-news type video, a sports type video, and the like. The video includes image information and original voice information, for example, when the video is a movie, the image information is a movie picture in the movie, and the original voice information is a character dialog, a voice over, background music, and the like in the movie.

The first play data includes text information for describing the image information. For example, the text message may be a descriptive text message such as "a red girl sitting in a room and gnawing a chicken leg".

And identifying the image information of the video to obtain first playing data. Since the text information capable of describing the image information can be obtained by identifying the image information, and the text information is used as the first playing data, the first playing data contains the information amount contained in the image information.

Optionally, the server stores a pre-trained recognition model, and the recognition model is used for processing, analyzing and understanding the image information to recognize various different patterns of targets and objects. The recognition model may be a face recognition model, a background recognition model, a motion recognition model, or the like. The server can identify the image information of the video based on the identification model to obtain first playing data.

Optionally, the video includes a plurality of video frames sequentially arranged according to a time sequence, each video frame corresponds to an appearance time point, and the appearance time point is represented by a time difference between a playing time point of the video frame and a starting time point of the video. The method comprises the steps that a server obtains a video, extracts a plurality of video frames from the video, inputs each video frame into a recognition model for recognition based on a pre-trained recognition model, and obtains text information corresponding to each video frame. The text information corresponding to each video and the corresponding time point of occurrence may be used as the first play data.

Further, the adjacent video frames are associated in content and can be considered together when the text information is acquired, so that the video is divided into a plurality of appearance time periods, each appearance time period comprises one or more video frames, the video frames in each appearance time period are identified, a plurality of text information corresponding to the plurality of appearance time periods respectively can be obtained, and the obtained plurality of text information and the corresponding appearance time periods are used as first playing data.

Or, since the image information of a plurality of temporally consecutive video frames may be the same, the obtained text information is also the same, and therefore, the identified plurality of consecutive and same text information is merged into one text information, the text information corresponds to an appearance time period, and the appearance time period is determined by the appearance time point of the last video frame and the appearance time point of the first video frame in the plurality of video frames. The text information and the corresponding presentation time period are taken as the first play data.

It should be noted that the appearance time period related to the embodiment of the present invention may include only one appearance time point, or may include a plurality of consecutive appearance time points. The occurrence periods may be 1s, 5s, 10s, etc., and the occurrence periods may be the same or different.

402. And the server acquires second playing data according to the first playing data and stores the second playing data.

The server converts the text information in the first playing data into the target voice information to acquire second playing data, and stores the second playing data. Since the text information in the first playback data can describe the image information of the video, the target voice information converted from the text information can also describe the image information of the video in a voice form, and the second playback data contains the information amount contained in the image information. For example, the target voice message may be a descriptive voice message such as "a red-coat girald sitting in a room and gnawing a chicken leg".

In one possible implementation, the server has pre-trained speech generation models stored therein. The server inputs the text information in the first playing data into a voice generation model, and generates target voice information corresponding to the text information based on the voice generation model. The Speech generation model is used To convert Text information into Speech information, and may be a TTS (Text To Speech) model or other models.

Optionally, the server creates a model database, the model database including a plurality of speech generating models. Different voice generating models have different parameters such as volume, speed, intonation, tone, timbre, language and the like so as to generate voice information with different characteristics. For example, the generated voice information may be a lively cartoon character voice, a deep male voice, a soft female voice, and the like.

Each voice generation model is provided with a corresponding preset label, the preset label of the voice generation model is used for representing the characteristics of the voice information generated by the voice generation model, and the preset label can be a type corresponding to the voice generation model, a subject of the voice generation model, a character of the voice generation model and the like. For example, the types of the speech generating models may be cartoon, swordsman, science fiction, news, fantasy, etc., the subject matter of the speech generating models may be comedy, tragedy, drama, etc., and the characters of the speech generating models may be movie character a, cartoon character B, etc.

The video also has a preset tag, the preset tag of the video is used for describing the type of the video, and the preset tag can be a video type, a video subject, a video character and the like. For example, the video genres may be cartoons, swordsmen, science fiction, news, fantasy, etc., the video titles may be comedy, tragedy, drama, etc., and the video characters may be movie character a, cartoon character B, etc.

Therefore, the server obtains the preset label of the video, determines the voice generation model corresponding to the preset label in the model database, and converts the text information into the target voice information based on the determined voice generation model, so that the characteristics of the target voice information are matched with the characteristics of the video, that is, the target voice information has the same style as the video, and the target voice information can describe the image information of the video more fully, so that a user can understand the content of the video through the target voice information.

Optionally, a video may include one or more preset tags, and the speech generation model in the model database may also include one or more preset tags. Therefore, when the video comprises a plurality of preset labels, the server acquires the preset labels, traverses a plurality of voice generation models in the model database, determines the number of the preset labels of each voice generation model, which is the same as the number of the preset labels of the video, and determines the voice generation model with the largest corresponding number. And inputting the text information into the voice generation model based on the determined voice generation model to generate target voice information.

Further, the server obtains the preset tags, traverses the voice generation models in the model database, determines a plurality of voice generation models including at least one preset tag in the preset tags from the model database, traverses the determined voice generation models, determines the number of preset tags of each voice generation model, which is the same as the number of preset tags of the video, and determines the voice generation model with the maximum corresponding number. And inputting the text information into the voice generation model based on the determined voice generation model to generate target voice information.

For example, fig. 5 is a schematic flowchart of determining a speech generating model, and referring to fig. 5, the tag obtaining module is configured to obtain a plurality of preset tags corresponding to a video, so as to provide data support for subsequent steps. The recall module is used for determining a plurality of voice generation models containing at least one preset label in the plurality of preset labels from the model database, so that the initial screening of the voice generation models is realized, and the operating efficiency and the quality of a recommendation result of a subsequent sequencing module are determined to a certain extent. The sorting module is used for sorting the plurality of voice generation models according to the plurality of voice generation models determined by the recall module and the sequence of the corresponding number from large to small, so that more refined grading sorting is realized. And the result display module is used for determining the first-order speech generation model in the sorting module and playing the target speech information generated by the speech generation model to the user.

Optionally, in step 401, when the text information and the occurrence time period corresponding to the text information are used as the first playing data, and the occurrence time period corresponding to the text information is the occurrence time period corresponding to the target voice information, the target voice information and the occurrence time period corresponding to the target voice information are used as the second playing data.

Optionally, after the server acquires the second playing data, the second playing data and the video identifier of the video are correspondingly stored. The video identifier can uniquely determine a video, and the video identifier can be a video name, a video number and the like.

It should be noted that the server may store one or more videos, and the videos involved in the embodiments of the present invention may be any number of videos of any type in the server. For example, each time the server acquires a new video, the step 401-.

403. And the terminal displays a video playing interface.

Optionally, a video playing application is run on the terminal, at least one video playing option is set in a main interface of the video playing application, and when a triggering operation of a user on any video playing option is detected, the terminal displays the video playing interface. The triggering operation of the playing option may be a click operation, a long-time press operation, a sliding operation, and the like. Or when the terminal opens the video playing application, a certain video recommended by the server is automatically acquired, and the playing interface of the video is displayed.

The playing interface is used for playing videos. The playing interface comprises a voice playing option, and the voice playing option is used for triggering a voice playing instruction so as to play the video in a voice form. The playing interface can further comprise a progress bar, a pause option, a sharing option, a cache option, an exit option and the like, wherein the progress bar is used for switching the playing progress of the video, the pause option is used for pausing the playing of the video, the sharing option is used for sharing the currently playing video, the cache option is used for caching the currently playing video, and the exit option is used for exiting the current playing interface.

In one possible implementation manner, when the terminal displays the playing interface, the video playing is not started temporarily, and a video playing option and a voice playing option are displayed. The video playing option is used for instructing the terminal to play the video in a video playing mode. The voice playing option is used for instructing the terminal to play the video in a voice playing mode, that is, instructing the terminal to play the target voice information corresponding to the video.

In another possible implementation manner, when the terminal displays the play interface, the default play mode is a video play mode, and therefore the terminal plays a video based on the play interface. And the playing interface comprises a voice playing option which is used for indicating the terminal to switch the video playing mode into the voice playing mode.

404. And when the terminal detects the triggering operation of the voice playing option, determining that a voice playing instruction is received.

Users with visual impairment or users who are currently in a scene inconvenient to watch videos, such as users who are doing housework and cannot watch television pictures, are inconvenient to watch image information of videos and hope to listen to voice information of the videos. At this time, the user may trigger a voice playing option to instruct the terminal to play the video in a voice form. And when the terminal detects the triggering operation of the voice playing option, determining that a voice playing instruction is received. The triggering operation can be a click operation, a long-time press operation, a sliding operation and the like, the voice playing instruction carries the video identifier of the video, and the voice playing instruction is used for indicating the terminal to play the target voice information of the video.

In another embodiment, step 404 may be replaced by the following steps: and when the terminal receives the input voice information and the voice information contains a voice playing keyword, determining that a voice playing instruction is received.

When a user wants to play a video in a voice form, voice information containing a voice playing keyword is input. When the terminal receives voice information input by a user, the voice information is converted into text information, the text is segmented to obtain at least one word, and whether the at least one word contains a voice playing keyword or not is detected. And when detecting that the at least one word comprises the voice playing keyword, the terminal determines to receive the voice playing instruction.

The voice playing keywords can be automatically set by the terminal or can be set by the user. For example, the voice playing keyword may be "i want to listen to video", and the like.

When the terminal receives the voice playing instruction, the subsequent steps of the embodiment of the invention can be executed, and the voice playing process is started. And when the terminal receives a voice playing instruction under the condition of displaying a video playing interface, the terminal can also display the prompt information, and when a confirmation instruction of the prompt information is received, the playing interface is closed.

Because the terminal plays the target voice information of the video in the voice playing mode, the image information of the video is not concerned. Therefore, unnecessary loss of the memory and the electric quantity of the terminal can be avoided by closing the playing interface.

The mode of closing the playing interface may be to close the playing interface of the video, return to the main interface of the video playing application, or close the display screen of the terminal, and the like, and the mode may be set by default of the terminal or by a user.

Optionally, when the terminal receives a voice playing instruction for the video, a prompt window is displayed based on a playing interface of the video, and the prompt window includes prompt information, a confirmation option and a denial option. The prompt information is used for prompting the user to trigger a confirmation option or a denial option to select whether to close the screen, for example, the prompt information may be a prompt text such as "whether to close the screen" or not, the confirmation option is used for determining to close the display screen of the terminal, and the denial option is used for determining not to close the display screen of the terminal, that is, not to close the video playing interface.

And the user triggers a confirmation option based on the prompt window, and when the terminal detects the triggering operation of the confirmation option, the terminal confirms that a confirmation instruction of the prompt information is received and closes the display screen. Or, the user triggers a denial option based on the prompt window, and when the terminal detects the triggering operation of the denial option, the terminal continues to keep the opening state of the display screen and continues to display the video playing interface. The trigger operation may be a click operation, a long-time press operation, a slide operation, or the like.

Optionally, when the terminal receives a voice playing instruction for a video, a prompt window is displayed based on a video playing interface, where the prompt window includes prompt information for prompting a user to input voice information including a keyword for closing the playing interface or a keyword for not closing the playing interface, for example, the prompt information may be suggestive text information such as "you can tell i whether i needs to close the screen", the keyword for closing the playing interface may be "close the screen", and the keyword for not closing the playing interface may be "not close the screen".

The user inputs voice information based on the prompt window, when the terminal receives the voice information input by the user, the voice information is converted into text information, the text is segmented to obtain at least one word, and whether the at least one word contains a keyword for closing the playing interface or not is detected. And when detecting that the at least one word comprises a keyword for closing the playing interface, closing the playing interface by the terminal. Or when detecting that the at least one word contains the keyword for not closing the playing interface, the terminal continues to display the playing interface.

Fig. 6 is a schematic diagram of a playback interface, in which the options indicated by arrows are voice playback options. Referring to fig. 7, when the terminal receives a voice playing instruction for a video, a first prompt window and a second prompt window are displayed based on a video playing interface, and the first prompt window displays "whether to close the screen", and further includes a confirmation option "yes" and a denial option "no". "you can tell me whether or not to close the screen" is displayed in the second prompt window.

For example, when a user needs to play a video in voice, the operation may be performed according to the flowchart shown in fig. 8. The video listening function is selected by triggering a voice playing option or inputting voice information, at the moment, the terminal is switched to a voice playing mode, original voice information in the video is switched to target voice information of the video so as to play the target voice information, and a user can also select whether to close the screen.

In addition, the terminal can run a voice assistant application, which is an intelligent application and can process the voice information input by the user by adopting a natural language processing technology, so that the user can have a natural conversation with the terminal to realize intelligent interaction. Therefore, the terminal can receive voice information input by a user through the voice assistant application, and when the voice information is determined to include a keyword for closing the playing interface, the playing interface is closed.

405. And the terminal sends a voice playing instruction of the video to the server.

And when the terminal receives the voice playing instruction, sending the voice playing instruction to the server. The voice playing instruction carries a video identifier, and the voice playing instruction is used for indicating the server to send second playing data corresponding to the video identifier to the terminal.

406. And when receiving the voice playing instruction, the server sends second playing data to the terminal.

And the server acquires second playing data corresponding to the stored video identifier according to the video identifier carried by the voice playing instruction, and sends the second playing data to the terminal.

407. The terminal receives the second playing data and plays the target voice information according to the second playing data.

The second playing data comprises target voice information, and the terminal plays the target voice information when receiving the second playing data.

In one possible implementation manner, the second playing data includes at least one target voice message and an occurrence time period corresponding to the at least one target voice message. In step 406, the server obtains the stored second playing data corresponding to the video identifier according to the video identifier carried in the voice playing instruction, and sequentially sends the target voice information in the second playing data to the terminal according to the sequence of the occurrence time periods corresponding to the target voice information. Correspondingly, in step 407, the terminal receives the corresponding target voice information at each occurrence time period, and then sequentially plays the received target voice information according to the sequence of the received at least one target voice information, thereby implementing online playing of the target voice information.

In another possible implementation, the video includes a first video segment and a second video segment, the first video segment includes image information but not original voice information, and the second video segment includes image information and original voice information. For example, the original voice information may be character voice-over, character dialog, background music, etc. in a video clip.

Therefore, the server may perform the above steps 401 and 402 for the first video segment in the video. The acquired first playing data comprise text information used for describing image information of the first video clip and corresponding appearance time periods, and the second playing data comprise target voice information obtained by converting the text information and corresponding appearance time periods.

And aiming at the second video clip in the video, the server can directly acquire the original voice information in the second video clip, and the original voice information also has a corresponding occurrence time period. Therefore, the original voice information and the corresponding time period of occurrence are also used as the second playing data.

That is, the second playing data includes the target voice information converted from the text information and the corresponding occurrence time period, and the original voice information of the second video segment and the corresponding occurrence time period.

Correspondingly, the server sequentially sends the voice information corresponding to each occurrence time period to the terminal according to the sequence of the occurrence time periods corresponding to the target voice information and the original voice information. The terminal plays the received voice information in sequence according to the sequence of the received voice information, and smooth connection between the target voice information and the original voice information is achieved.

For example, the video includes a video clip a, a video clip B, and a video clip C arranged in order, and the three video clips respectively correspond to an occurrence time period. Wherein, the video clip B includes the original voice information y, the server obtains the target voice information x of the video clip a and the target voice information z of the video clip C through the above steps 401 and 402, and the second playing data is as shown in table 1.

TABLE 1

Video clip A	Target speech information x	Epoch 1
			Video clip B	Original speech information y	Epoch 2
Video clip C	Target speech information z	Epoch 3

It should be noted that, this embodiment only takes the example that when the server receives the voice playing instruction, the second playing data is sent to the terminal on line, so that the terminal plays the second playing data. In another embodiment, the server may also send the video and the second playing data of the video to the terminal in advance, and the terminal stores the video and the second playing data correspondingly. And when the terminal receives a voice playing instruction of the video, acquiring second playing data corresponding to the video, and playing the target voice information according to the second playing data.

For example, the server determines a video to be recommended for the terminal by using a recommendation algorithm, acquires second playing data of the video, and sends the video and the second playing data to the terminal for the terminal to play.

It should be noted that, in this embodiment, the terminal executes step 403, step 404, step 405, and step 407 only as an example. In another embodiment, the terminal is installed with a video playing application of a third party, or the terminal is provided with a video playing application, and the video playing application executes the steps 403, 404, 405, and 407.

On the basis of the above embodiment, in one possible implementation manner, step 401 and step 402 may include:

and identifying the target object in the image information of the video to obtain an object identifier belonging to the target object and a corresponding appearance time period, using the object identifier as first playing data, converting the object identifier in the first playing data into target voice information, and using the target voice information and the corresponding appearance time period as second playing data.

The target object refers to a person, a background, an action, an article, a text and the like included in the image information, the object identifier is used for uniquely determining one target object, for example, the object identifier may be a person identifier, a background identifier, an action identifier, an article identifier, a text identifier and the like, the person identifier may be a person name, a person number and the like, the background identifier may be a background name, a background number and the like, the action identifier may be an action name, an action number and the like, the article identifier may be an article name, an article number and the like, and the text identifier may be a text name, a text number and the like.

Optionally, the target object includes at least two of a person, a background, or an action, at least two of the person, the background, or the action in the image information are respectively recognized to obtain at least two play data items, object identifiers corresponding to the same occurrence time period in the at least two play data items are combined according to a preset sentence structure to form text information, and the text information and the corresponding occurrence time period are used as first play data.

Each playing data item comprises object identifications belonging to the same target object and corresponding appearance time periods. For example, the play data items may be as shown in table 2.

TABLE 2

The preset sentence structure may be a structure in which subjects, predicates, and objects are combined in order of the chinese grammar, and for example, text information is formed by combining a person identifier in an object identifier as a subject, an action identifier as a predicate, a background identifier as an object, and a sentence structure of "subject + object + predicate".

Referring to fig. 9, in a case that the image information includes three items of a person, a background, and an action, the process of identifying the image information of the video to obtain first play data and obtaining second play data according to the first play data may include:

901. the server identifies the person in the image information to obtain the person identification and the corresponding appearance time period, and the person identification and the corresponding appearance time period are used as the playing data item of the first playing data.

Optionally, the server stores a pre-trained face recognition model, and the face recognition model is used for processing, analyzing and understanding image information to recognize various faces. Then, based on the face recognition model, the person in the image information can be recognized to obtain the person identification.

Optionally, the server obtains a video to be processed, extracts a plurality of video frames from the video, and identifies people in each video frame based on the face recognition model to identify various different people in the image information, so as to obtain the people identification in the video frame.

The adjacent video frames are associated in content and can be considered together when the text information is acquired, so that the video is divided into a plurality of appearance time periods, each appearance time period comprises one or more video frames, the person in the video frame in each appearance time period is identified, and the person identification corresponding to each of the appearance time periods can be obtained and used as the playing data item of the first playing data.

Alternatively, each video frame in the video has a corresponding time point of occurrence, and the image information of a plurality of video frames that are consecutive in time may be the same, the identified person identifier is also the same. Therefore, the same character identifier corresponds to a plurality of video frames, that is, corresponds to the appearance time periods of the plurality of video frames, and the character identifier and the corresponding appearance time period obtained by identification are used as the play data item of the first play data.

902. The server identifies the background in the image information to obtain a background identifier and a corresponding appearance time period, and the background identifier and the corresponding appearance time period are also used as the playing data item of the first playing data.

Optionally, the server stores a pre-trained background recognition model, and the background recognition model is used for processing, analyzing and understanding the image information to recognize various different backgrounds. Wherein, the background can be mountains, fields, rooms, railway stations, and the like.

Optionally, the server obtains a video to be processed, extracts a plurality of video frames from the video, inputs the video frames into a background recognition model, and recognizes image information of the video frames based on the background recognition model to obtain a background identifier in the video frames, wherein the background identifier is used for uniquely determining a background. The context identifier may be a context name, a context code, etc.

Similar to the step 901, the server will also use the background identifier and the corresponding occurrence time period as the playing data item of the first playing data, and the specific process is not described herein again.

903. The server identifies the action in the image information to obtain an action identifier and a corresponding occurrence time period, and the action identifier and the corresponding occurrence time period are also used as the playing data item of the first playing data.

Optionally, the server stores a pre-trained motion recognition model, and the motion recognition model is used for processing, analyzing and understanding the image information to recognize various different motions. Wherein the action may be running, sitting, jumping, eating, etc.

Optionally, the server obtains a video to be processed, extracts a plurality of video frames from the video, inputs the video frames into the action recognition model, and recognizes image information of the video frames based on the action recognition model to obtain an action identifier in the video frames, where the action identifier is used to uniquely determine an action. The action identifier may be an action name, an action code number, etc.

Fig. 10 is a flowchart of action recognition according to an embodiment of the present invention, where a server acquires a video to be processed, performs feature extraction on each video frame in the video to obtain an action feature of each video frame, and recognizes the action feature of each video frame to obtain an action identifier of each video frame.

Similar to the step 901, the server will also use the action identifier and the corresponding occurrence time period as the playing data item of the first playing data, and the specific process is not described herein again.

It should be noted that, this embodiment is only taken as an example to be executed in the order of step 901, step 902, and step 903, and in another embodiment, the server may execute at least two of step 901, step 902, and step 903, and the execution order is not limited.

904. The server combines the object identifications corresponding to the same appearing time period in the three playing data items according to a preset sentence pattern structure to form text information, and takes the text information and the corresponding appearing time period as first playing data.

In the three playing data items, each playing data item includes an object identifier and a corresponding time period of occurrence, and the object identifier may include a character identifier, a background identifier, an action identifier, and the like.

The server acquires the character identification, the background identification and the action identification corresponding to the same appearing time period, combines the character identification, the background identification and the action identification according to a preset sentence pattern structure to form text information, and takes the text information and the corresponding appearing time period as first playing data. The text information may describe a person, a background, and an action of the image information.

The preset sentence structure may be a structure in which the subject, the predicate, and the object are combined in an order that conforms to the grammar of chinese.

For example, the preset sentence structure is: the character identifier in the object identifier is used as a subject, the action identifier is used as a predicate, the background identifier is used as an object, the character identifier is used as a 'red clothing girald', the background identifier is used as a 'room', the action identifier is used as an example of 'sitting and gnawing the drumstick', the character identifier, the background identifier and the action identifier are combined according to a sentence structure of 'subject + object + predicate', and the obtained text information is 'one red clothing girald sitting and gnawing the drumstick in the room'.

Note that, when only one of step 901, step 902, and step 903 is executed, the obtained play data item is directly used as the first play data. For example, only step 901 is executed, that is, only when the person is identified, the person identifier and the corresponding appearing time period are obtained as the first playing data.

905. The server converts the text information into target voice information, and takes the target voice information and the corresponding occurrence time period as second playing data.

Step 905 is similar to step 402, and is not described in detail herein.

The step 403-.

In the method provided by the embodiment of the invention, the text information is formed by combining the object identifiers corresponding to the same appearance time period in the three playing data items according to the preset sentence pattern structure, and the text information and the corresponding appearance time period are used as the first playing data. The text information comprises different object identifications, different object identifications are combined, the formed text information can describe the image information more fully, richer information quantity is provided, and the video playing efficiency is further improved.

Fig. 11 is a flowchart of character recognition according to an embodiment of the present invention, fig. 12 is a flowchart of character recognition based on a character segmentation model according to an embodiment of the present invention, and fig. 13 is a flowchart of character recognition taking a target object as a character a as an example. Referring to fig. 11-13, in one possible implementation, the above step 901 may include:

1101. the server acquires a plurality of pieces of character characteristic information of the video, wherein each piece of character characteristic information comprises a character identification and a plurality of face images matched with the character identification.

The image information of the video includes a plurality of characters, and it is necessary to train a character segmentation model in advance in order to recognize each character and acquire text information. Thus, a video provider may upload pieces of persona feature information for a video to a server that trains a persona segmentation model based on the pieces of persona feature information. The video provider may be any user who publishes the video, or some user who views the video, or a maintenance person of the server, etc.

For example, the video is a movie, and the video producer obtains the character name of each character in the video and a plurality of face images of each character and uploads the face images to the server.

1102. And the server trains a character division sub-model according to the plurality of face images in each piece of character feature information.

For each piece of character feature information, the server obtains a plurality of face images in the character feature information, performs face feature extraction on the plurality of face images to obtain sample face features, and trains a character division sub-model according to the sample face features and character identifications corresponding to the sample face features, wherein one character division sub-model corresponds to one piece of character feature information, namely one character division sub-model corresponds to one character identification. And finally, training a plurality of character division submodels according to the characteristic information of each character.

The face feature extraction is used for extracting the face features of the face image, and can be regarded as a process for performing feature modeling on a face. The face features are generally classified into visual features, pixel statistical features, face image transform coefficient features, face image algebraic features, and the like. The algorithm for extracting the face features can be as follows: face feature point-based algorithms, illumination estimation model-based algorithms, deep learning-based algorithms, template-based algorithms, and the like.

For example, the algorithm of face feature extraction is CNN (convolutional Neural Network) based on deep learning. CNN is an artificial neural network based on deep learning theory, which utilizes weight sharing to reduce the parameter expansion problem in the common neural network, uses convolution kernel to perform convolution operation on input data in the forward calculation process, and uses the obtained result as the output of the layer through a nonlinear function, wherein the layer is called a convolutional layer, a downsampling layer is generated between the convolutional layer and the convolutional layer, and the downsampling layer is used for acquiring the invariance of local features and reducing the scale of a feature space. Typically, a convolutional layer and a downsampling layer are followed by a fully connected neural network for face recognition. In addition, the feature extraction algorithm may be other algorithms.

In addition, when the character classification submodel is trained according to the sample face features and the character identifications corresponding to the sample face features, the adopted training algorithm can be a CNN algorithm, a recurrent neural network algorithm, a deep learning network algorithm and other algorithms.

Optionally, the server obtains a plurality of face images in a piece of person feature information, may perform preprocessing on the plurality of face images, identify face regions in the face images, cut the face regions, and then perform feature extraction on the face regions that are switched out, so as to obtain a sample face feature of each person.

The face image is preprocessed, the follow-up feature extraction process can be conveniently carried out, the face image acquired by the server is limited by various conditions and is interfered randomly, the face image cannot be used directly, preprocessing can be carried out at the moment, and the preprocessing comprises light compensation, gray level transformation, histogram equalization, normalization, geometric correction, filtering, sharpening and the like of the face image, so that the display effect of the image is improved, and the follow-up processing is facilitated.

The face image cutting means that a face region in a face image is cut, and the face region may include a face region and a non-face region, so that the face region needs to be cut out to facilitate subsequent feature extraction of a face and reduce calculation amount.

Referring to a flow chart shown in fig. 12, a process of training a character segmentation model may be that, a server performs operations of picture preprocessing, face recognition and segmentation, and feature extraction on face images corresponding to a plurality of character identifications, obtains sample face features corresponding to each face image, trains a character segmentation sub-model corresponding to the character identification according to the obtained sample face features, thereby obtaining character segmentation sub-models corresponding to the plurality of character identifications, and the obtained character segmentation sub-models constitute a character segmentation model.

1103. And the server identifies the face in the image information of the video to obtain the face characteristics and the corresponding appearance time period.

Optionally, the server obtains a video to be processed, extracts a plurality of video frames from the video, and performs feature extraction on the face in each video frame to obtain the face feature of each person in the video frame. Correspondingly, the appearance time period corresponding to the face feature may also be obtained, and the obtaining mode is similar to the obtaining mode of the appearance time period corresponding to the target voice information, and is not described herein again.

Optionally, when the server obtains the face features of each person in each video frame, the server may perform operations such as preprocessing, face recognition and segmentation, feature extraction, and the like on the video frames, and the specific steps are similar to the step 1102 and are not described herein again.

1104. The server respectively obtains the classification identification of the face features based on the plurality of character classification submodels.

And inputting the human face characteristics into any character classification submodel to obtain the classification identification of the human face characteristics under the character classification submodel.

Each character classification submodel corresponds to a character identifier, and the classification identifier comprises a first identifier or a second identifier. The first identification represents that the face features are matched with the figures corresponding to the figure classification submodels, and the second identification represents that the face features are not matched with the figures corresponding to the figure classification submodels.

For example, the first flag is "1" and the second flag is "0". When the human face features are input into the character classification submodel to obtain '1', the human face features are matched with the character identification corresponding to the character classification submodel; and when the human face features are input into the character classification submodel to obtain '0', the human face features are indicated to be not matched with the character identification corresponding to the character classification submodel, and the human face features are input into other character classification submodels for detection.

1105. And when the classification identification acquired based on any character classification submodel is a first identification, the server takes the character identification corresponding to any character classification submodel as the character identification corresponding to the human face characteristic.

And when the human face features are input into each character dividing sub-model, obtaining the classification identification of the human face features. And when the classification identification acquired by any character classification submodel is a first identification, the human face characteristics are matched with the characters corresponding to the character classification submodel. Therefore, the character identification corresponding to the character classification sub-model is used as the character identification corresponding to the face feature.

The process of identifying the person can refer to a flow chart shown in fig. 12, and the server performs operations of picture preprocessing, face identification and cutting, and feature extraction on each video frame in the video to obtain the face feature of each person in each video frame. And inputting the acquired face features into a character classification model, namely respectively inputting the acquired face features into each character classification submodel to obtain classification identifications of the face features. And when the classification identification acquired by any character classification submodel is the first identification, taking the character identification corresponding to the character classification submodel as the character identification corresponding to the face characteristic, and finishing the matching process of the face characteristic and the character classification submodel.

It should be noted that, in the embodiment of the present invention, the process of the person identification is only described in the above steps 1101-1105, but in another embodiment, the adopted person classification model may also be other types of models, such as a deep learning model, a convolutional neural network model, and the like, and it is only required to ensure that the person classification model can determine the person identifier corresponding to the face feature according to the face feature.

1106. And the server takes the figure identification corresponding to the face features and the corresponding appearance time period as first playing data.

Because the character identification corresponds to the face feature, the appearance time period corresponding to the character identification is the same as the appearance time period corresponding to the face feature. And taking the character identification and the corresponding appearance time period as first playing data.

Taking an example that a target object is a character a in a movie, referring to a flow chart of character recognition shown in fig. 13, an operator submits a movie to be detected and a plurality of pictures of the character a in the movie to a server, and the server trains a character classification sub-model of the character a according to the plurality of pictures of the character a. And acquiring the facial features and the corresponding appearance time periods in the film, so as to judge whether each facial feature in the film belongs to the character A or not based on the character classification submodel of the character A, thereby finding out the appearance time period of the character A in the film, and storing the appearance time period of the character A for subsequent use.

In this embodiment, the process of recognizing the target object in the image information will be described by taking only the target object as a person as an example. When the target object is a background, an action, etc., the process of identifying the target object in the image information is similar to the above-mentioned step 1101-1106, and is not described herein again. The difference is that the sample images used are different. The sample images used when the character classification model is trained include faces, the sample images used when the background recognition model is trained include backgrounds, and the sample images used when the motion recognition model is trained include motions.

The method provided by the embodiment of the invention acquires the face characteristics by identifying the face in the image information, acquires the figure identification corresponding to the face characteristics by the figure division submodel, and takes the figure identification and the corresponding occurrence time period as the first playing data. Therefore, the person included in the corresponding image information can be known through the person identifier in the first playing data.

Moreover, the video frame is subjected to operations such as preprocessing, face recognition and cutting, feature extraction and the like, the obtained face features are more accurate, the accuracy of subsequent face recognition is improved, irrelevant information in the video frame is removed, and the calculation amount is reduced.

Fig. 14 is a schematic structural diagram of a voice information playing apparatus according to an embodiment of the present invention. Referring to fig. 14, the apparatus includes:

an identifying module 1401, configured to identify image information of a video to obtain first playing data, where the first playing data includes text information used for describing the image information;

an obtaining module 1402, configured to obtain second playing data according to the first playing data, where the second playing data includes target voice information obtained by converting text information;

the playing module 1403 is configured to play the target voice information according to the second playing data when receiving the voice playing instruction for the video.

Optionally, referring to fig. 15, the apparatus further comprises:

the display module 1404 is configured to display a video playing interface, where the playing interface includes a voice playing option, and when a triggering operation on the voice playing option is detected, it is determined that a voice playing instruction is received; alternatively, the first and second electrodes may be,

the receiving module 1405 is configured to receive the input voice information, and determine that a voice playing instruction is received when the voice information includes a voice playing keyword.

Optionally, referring to fig. 15, the apparatus further comprises:

the display module 1406 is configured to display prompt information when receiving a voice playing instruction in a case that a video playing interface is displayed, where the prompt information is used to prompt a user to close the video playing interface;

a closing module 1407, configured to close the playing interface when the confirmation instruction for the prompt information is received.

Alternatively, referring to fig. 15, the identification module 1401 comprises:

the identification unit 1411 is configured to identify a target object in the image information, obtain an object identifier belonging to the target object and a corresponding occurrence time period, and use the object identifier and the corresponding occurrence time period as first playing data;

an acquisition module 1402, comprising:

a converting unit 1412, configured to convert the object identifier into target voice information, and use the target voice information and the corresponding occurrence time period as second playing data.

Alternatively, referring to fig. 15, the target object includes at least two of a person, a background, or an action, and the identifying unit 1411 includes:

an object identification subunit 14111, configured to respectively identify at least two of a person, a background, or an action in the image information, to obtain at least two playing data items, where each playing data item includes an object identifier and a corresponding occurrence time period that belong to a same target object;

a combining subunit 14112, configured to combine, in at least two playing data items, object identifiers corresponding to the same occurrence time period according to a preset sentence structure to form text information, and use the text information and the corresponding occurrence time period as the first playing data.

Alternatively, referring to fig. 15, the target object includes a person, and the identifying unit 1411 includes:

a face recognition subunit 14113, configured to recognize a face in the image information to obtain a face feature and a corresponding occurrence time period;

an obtaining subunit 14114, configured to obtain, based on the person classification model, a person identifier corresponding to a face feature;

the determining subunit 14115 is configured to use the person identifier corresponding to the facial feature and the corresponding occurrence time period as the first playing data.

Optionally, referring to fig. 15, the character segmentation model comprises a plurality of character segmentation submodels, each character segmentation submodel having a corresponding character identification;

an acquisition subunit 14114 to:

respectively acquiring classification identifications of the face features based on a plurality of character classification submodels, wherein the classification identifications comprise first identifications or second identifications, the first identifications represent that the face features are matched with characters corresponding to the character classification submodels, and the second identifications represent that the face features are not matched with the characters corresponding to the character classification submodels;

Optionally, referring to fig. 15, the apparatus further comprises:

the feature acquisition module 1408 is configured to acquire a plurality of pieces of person feature information of the video, where each piece of person feature information includes a person identifier and a plurality of face images matched with the person identifier;

the training module 1409 is configured to train a character classification submodel according to the plurality of face images in each piece of character feature information.

Optionally, referring to fig. 15, the obtaining module 1402 includes:

a tag obtaining unit 1422, configured to obtain a preset tag of the video;

a model determining unit 1432, configured to determine a speech generation model corresponding to a preset tag in a model database, where the model database includes a plurality of speech generation models and corresponding tags;

a converting unit 1412, configured to convert the text information into the target speech information based on the determined speech generation model.

Optionally, referring to fig. 15, the second playing data includes the target voice information and the corresponding time period of occurrence; the play module 1403 includes:

the playing unit 1413 is configured to sequentially play the target voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

Alternatively, referring to fig. 15, the video includes a first video clip and a second video clip, the first video clip includes image information but does not include original voice information, and the second video clip includes image information and original voice information;

the first playing data comprises text information used for describing image information of the first video clip and a corresponding occurrence time period;

the second playing data comprises target voice information obtained by text information conversion and a corresponding occurrence time period, and original voice information of the second video clip and a corresponding occurrence time period;

the play module 1403 includes:

the playing unit 1413 is configured to sequentially play the voice information corresponding to each occurrence time period according to the sequence of each occurrence time period.

Fig. 16 is a schematic structural diagram of another audio information playback device according to an embodiment of the present invention. Referring to fig. 16, the apparatus includes:

an identifying module 1601, configured to identify image information of a video to obtain first playing data, where the first playing data includes text information used for describing the image information;

an obtaining module 1602, configured to obtain second playing data according to the first playing data, where the second playing data includes target voice information obtained by converting text information;

a receiving module 1603, configured to receive a voice playing instruction of the terminal to the video;

the sending module 1604 is configured to send the target voice information to the terminal according to the second playing data, so that the terminal plays the target voice information.

Optionally, referring to fig. 17, the second playing data includes the target voice information and the corresponding time period of occurrence; a sending module 1604 comprising:

a sending unit 1614, configured to send, to the terminal in sequence according to the sequence of each occurrence time period, the target voice information corresponding to each occurrence time period, so that the terminal plays the target voice information corresponding to each occurrence time period in sequence.

It should be noted that: in the voice information playing apparatus provided in the above embodiment, when playing the voice information, only the division of the above functional modules is exemplified, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structures of the terminal and the server are divided into different functional modules, so as to complete all or part of the above described functions. In addition, the voice information playing apparatus and the voice information playing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 18 is a block diagram illustrating a terminal 1800 according to an exemplary embodiment of the present invention. The terminal 1800 may be a portable mobile terminal such as: the mobile terminal comprises a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, Moving Picture Experts compress standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts compress standard Audio Layer 4), a notebook computer, a desktop computer, a head-mounted device, a smart television, a smart sound box, a smart remote controller, a smart microphone, or any other smart terminal. The terminal 1800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

Generally, the terminal 1800 includes: a processor 1801 and a memory 1802.

The processor 1801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. Memory 1802 may include one or more computer-readable storage media, which may be non-transitory, for storing at least one instruction for processor 1801 to implement a method for playing voice information provided by method embodiments herein.

In some embodiments, the terminal 1800 may further optionally include: a peripheral interface 1803 and at least one peripheral. The processor 1801, memory 1802, and peripheral interface 1803 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1804, display 1805 and audio circuitry 1806.

The Radio Frequency circuit 1804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1804 communicates with communication networks and other communication devices via electromagnetic signals.

The display screen 1805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The display 1805 may be a touch display and may also be used to provide virtual buttons and/or a virtual keyboard.

The audio circuitry 1806 may include a microphone and a speaker. The microphone is used for collecting audio signals of a user and the environment, converting the audio signals into electric signals, and inputting the electric signals to the processor 1801 for processing or inputting the electric signals to the radio frequency circuit 1804 to achieve voice communication. The microphones may be provided in a plurality, respectively, at different positions of the terminal 1800 for the purpose of stereo sound collection or noise reduction. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1801 or the radio frequency circuitry 1804 to audio signals.

Those skilled in the art will appreciate that the configuration shown in fig. 18 is not intended to be limiting of terminal 1800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 19 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 1900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1901 and one or more memories 1902, where the memory 1902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1901 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

Server 1900 may be configured to perform the steps performed by the server in the voice message playing method.

An embodiment of the present invention further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, code set, or instruction set, and the instruction, the program, the code set, or the instruction set is loaded by the processor and has an operation to implement the voice information playing method of the foregoing embodiment.

An embodiment of the present invention further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored in the computer-readable storage medium, and the instruction, the program, the code set, or the instruction set is loaded by a processor and has an operation to implement the voice information playing method of the foregoing embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only a preferred embodiment of the present invention, and should not be taken as limiting the invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for playing voice information, the method comprising:

identifying image information of a video to obtain first playing data, wherein the first playing data comprise text information used for describing the image information; acquiring a preset label of the video, wherein the preset label of the video is used for representing the characteristics of the video;

determining a voice generation model corresponding to the preset tag in a model database, wherein the model database comprises a plurality of voice generation models and corresponding preset tags, and the preset tag of each voice generation model is used for representing the characteristics of voice information generated by each voice generation model;

converting the text information in the first playing data into target voice information based on the determined voice generation model, and taking the target voice information as second playing data; when a voice playing instruction for the video is received, according to the second playing data and the sequence of each occurrence time period, sequentially playing voice information corresponding to each occurrence time period;

the video comprises a first video segment and a second video segment, the first video segment comprises image information but not original voice information, and the second video segment comprises image information and original voice information;

the first playing data comprises text information used for describing the image information of the first video clip and a corresponding occurrence time period; the second playing data comprises target voice information obtained by converting the text information and a corresponding occurrence time period, and original voice information of the second video clip and a corresponding occurrence time period;

the identifying the image information of the video to obtain first playing data comprises:

identifying a target object in the image information to obtain an object identifier belonging to the target object and a corresponding appearance time period, and taking the object identifier and the corresponding appearance time period as the first playing data, wherein the object identifier is used for uniquely determining one target object;

alternatively, the first and second electrodes may be,

dividing the first video segment into a plurality of presentation time periods, each presentation time period comprising one or more video frames; identifying the video frame in each occurrence time period to obtain a plurality of text messages respectively corresponding to the plurality of occurrence time periods, and taking the plurality of text messages and the corresponding occurrence time periods as the first playing data; or identifying the video frames in each occurrence time period, merging a plurality of continuous and same text messages obtained by identification into one text message, wherein the text message corresponds to one occurrence time period, the occurrence time period is determined by the occurrence time point of the last video frame and the occurrence time point of the first video frame in the plurality of video frames, and the text message and the corresponding occurrence time period are used as the first playing data.

2. The method according to claim 1, wherein when the voice playing instruction for the video is received, before the voice information corresponding to each occurrence time period is sequentially played according to the sequence of each occurrence time period, the method further comprises:

3. The method of claim 1, further comprising:

4. The method according to claim 1, wherein if the object identifier and the corresponding time slot are used as the first playing data; the acquiring of the second playing data according to the first playing data includes:

5. The method according to claim 4, wherein the target object includes at least two of a person, a background, or an action, and the recognizing the target object in the image information to obtain an object identifier belonging to the target object and a corresponding time period of occurrence, and taking the object identifier and the corresponding time period of occurrence as the first playback data includes:

6. The method according to claim 4, wherein the target object includes a person, and the recognizing the target object in the image information to obtain an object identifier belonging to the target object and a corresponding time period of occurrence, and using the object identifier and the corresponding time period of occurrence as the first playing data comprises:

7. The method of claim 6, wherein the character compartment model comprises a plurality of character compartment submodels, each character compartment submodel having a corresponding character identification;

8. A method for playing voice information, the method comprising:

converting the text information in the first playing data into target voice information based on the determined voice generation model, and taking the target voice information as second playing data; receiving a voice playing instruction of the terminal to the video; sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information conveniently;

the sending the target voice information to the terminal according to the second playing data so that the terminal can play the target voice information comprises: according to the sequence of the appearance time periods corresponding to the target voice information and the original voice information, sequentially sending the voice information corresponding to each appearance time period to the terminal, so that the terminal can sequentially play the received voice information according to the sequence of the received voice information;

alternatively, the first and second electrodes may be,

dividing the first video segment into a plurality of presentation time periods, each presentation time period comprising one or more video frames; identifying the video frame in each occurrence time period to obtain a plurality of text messages respectively corresponding to the plurality of occurrence time periods, and taking the plurality of text messages and the corresponding occurrence time periods as the first playing data; or identifying the video frames in each occurrence time period, and merging a plurality of continuous and same text messages obtained by identification into one text message, wherein the text message corresponds to one occurrence time period, and the occurrence time period is determined by the occurrence time point of the last video frame and the occurrence time point of the first video frame in the plurality of video frames; and taking the text information and the corresponding appearance time period as the first playing data.

9. A voice information playing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a preset label of the video, wherein the preset label of the video is used for representing the characteristics of the video;

the obtaining module is further configured to determine a speech generation model corresponding to the preset tag in a model database, where the model database includes a plurality of speech generation models and corresponding preset tags, and the preset tag of each speech generation model is used to represent characteristics of speech information generated by each speech generation model;

the obtaining module is further configured to convert the text information in the first playing data into target voice information based on the determined voice generation model, and use the target voice information as second playing data;

the playing module is used for sequentially playing the voice information corresponding to each occurrence time period according to the second playing data and the sequence of each occurrence time period when receiving the voice playing instruction of the video;

the identification module is configured to:

alternatively, the first and second electrodes may be,

10. A voice information playing apparatus, characterized in that the apparatus comprises:

a sending module, configured to send the target voice information to the terminal according to the second playing data, so that the terminal plays the target voice information;

the sending module comprises: the sending unit is used for sequentially playing the voice information corresponding to each occurrence time period according to the sequence of each occurrence time period;

the identification module is configured to:

alternatively, the first and second electrodes may be,

11. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to carry out the operations carried out in the method for playing back speech information according to any one of claims 1 to 7 or to carry out the operations carried out in the method for playing back speech information according to claim 8.

12. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the operations performed in the method for playing back voice information as claimed in any one of claims 1 to 7 or to implement the operations performed in the method for playing back voice information as claimed in claim 8.