CN112052358A

CN112052358A - Method, apparatus, electronic device and computer readable medium for displaying image

Info

Publication number: CN112052358A
Application number: CN202010929638.9A
Authority: CN
Inventors: 邓启力
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-08

Abstract

Embodiments of the present disclosure disclose a method, an apparatus, an electronic device, and a computer-readable medium for displaying an image. One embodiment of the method comprises: extracting audio information and an image to be processed from a video to be processed; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed. The implementation method can enable the user to clarify the information transfer relation between the target pronunciation information and the target object image, enhance the visual effect and is beneficial to improving the effectiveness of the user in acquiring the information in the video to be processed.

Description

Method, apparatus, electronic device and computer readable medium for displaying image

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for displaying an image, an electronic device, and a computer-readable medium.

Background

The video comprises various information such as audio information, image information and the like, and vivid visual experience can be provided for the user. In practice, the relationship between the audio information and the image information in the video is not obvious enough, and a user cannot easily know the relationship between the audio information and the image information, so that the effectiveness of the user in obtaining the information in the video is reduced.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a method, an apparatus, an electronic device, and a computer-readable medium for displaying an image to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method of displaying an image, the method comprising: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one piece of pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one piece of pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

In a second aspect, some embodiments of the present disclosure provide an apparatus for displaying an image, the apparatus comprising: an information extraction unit configured to extract audio information and an image to be processed from a video to be processed, the audio information including at least one pronunciation information, the image to be processed including an object image corresponding to the pronunciation information in the at least one pronunciation information; the relation establishing unit is configured to establish a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; a target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by image processing of the target object image; and the image display unit is used for responding to the triggering of the target pronunciation information in the audio information and displaying a target image corresponding to the target pronunciation information in the video to be processed.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to perform the method of displaying an image of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method of displaying images of the first aspect.

One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, audio information and an image to be processed are extracted from a video to be processed, then, the corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed is established, and the information transfer relation is clarified. And then, setting a target image corresponding to each target object image contained in the image to be processed, so that the experience of watching the video by the user is deepened. And finally, when the target pronunciation information in the audio information is triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed. The method and the device can enable the user to clarify the information transfer relation between the target pronunciation information and the target object image, enhance the visual effect and are beneficial to improving the effectiveness of the user in obtaining the information in the video to be processed.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

FIG. 1 is a schematic illustration of an application scenario of a method of displaying an image of some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a method of displaying an image according to the present disclosure;

FIG. 3 is a flow diagram of further embodiments of a method of displaying an image according to the present disclosure;

FIG. 4 is a flow diagram of still further embodiments of methods of displaying an image according to the present disclosure;

FIG. 5 is a schematic block diagram of some embodiments of an apparatus for displaying an image according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of one application scenario of a method of displaying an image of some embodiments of the present disclosure.

In the application scenario of fig. 1, the electronic device 101 extracts audio information and a to-be-processed image from a to-be-processed video. The image to be processed comprises two persons, and the audio information is voice information sent from a first person to a second person. The electronic device 101 may query the target pronunciation information 102 from the audio information and the target object image 103 corresponding to the target pronunciation information 102. The target pronunciation information 102 may be "watch movie", "shopping", etc. The target object image 103 may be an ear image of the second person. Then, the electronic device 101 may establish a correspondence relationship between the target pronunciation information 102 and the target object image 103, which may be used to represent that "watch movie", "visit" and "walk" act on the ear of the second person. To increase the visual effect when the user views the video to be processed, the electronic device 101 may set the target image 104 of the target object image 103. The target image 104 may be a pair of rabbit ears, which may exaggeratedly indicate that the second person is listening to the voice information of the first person, thereby enhancing the visual impression of the user. When the target pronunciation information 102 is played (i.e., triggered) in the video to be processed, the electronic device 101 may display the target image 104.

It should be understood that the number of electronic devices 101 in fig. 1 is merely illustrative. There may be any number of electronic devices 101, as desired for implementation.

With continued reference to fig. 2, fig. 2 illustrates a flow 200 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

step 201, extracting audio information and an image to be processed from a video to be processed.

In some embodiments, the execution subject of the method of displaying an image (e.g., the electronic device 101 shown in fig. 1) may extract audio information and an image to be processed from a video to be processed by a wired connection or a wireless connection. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G/5G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The execution subject may first obtain the video to be processed. The video to be processed may be a real-time video containing a face image, or may be a non-real-time video containing a face image. The face image may be a human face image, an animal face image, a statue face image, or the like. The execution subject may extract audio information and an image to be processed in the video to be processed. The audio information may include at least one pronunciation information. The enunciated information may be various audio, for example, the enunciated information may be human voice audio, bird audio, automobile engine audio, ambient noise, or the like. The image to be processed may include an object image corresponding to pronunciation information in the at least one piece of pronunciation information. The performing subject may perform image recognition on the image to be processed, first recognize a head image of a person, an animal or a statue in the image to be processed, and then specifically recognize various parts (such as eyes, a nose, a mouth, etc.) included in the head image. The pronunciation information may be various, some of which may have an object image and some of which may not have an object image. For example, if the pronunciation information is speech uttered by the first person to the second person, the object image may be an ear of the second person. If the pronunciation information is environmental noise, there may be no object image.

Step 202, establishing a corresponding relationship between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed.

As can be seen from the above description, some pronunciation information may have corresponding object images. For convenience of analysis, the execution subject may set pronunciation information of the presence object image as target pronunciation information and set the object image as a target object image corresponding to the target pronunciation information. Thereafter, the executing body may establish a correspondence between the target pronunciation information and the target object image. It should be noted that one target pronunciation information may have a plurality of target object images. For example, the video to be processed is a student class video, one teacher gives a lecture, and a plurality of students attend a lecture. Correspondingly, the target pronunciation information may be the voice of the teacher, and the target object image may be an ear image of each student. There may be one target object image for a plurality of target pronunciation information. For example, a plurality of students simultaneously take lessons and a teacher listens. Correspondingly, the target pronunciation information is the voice of each student, and the target object image may be an image of the teacher's ear. It is also possible to use a plurality of target object images in the plurality of target utterance information. For example, multiple people sing a song at the same time, and multiple listeners listen to the song. Correspondingly, the target pronunciation information is singing voice of each person, and the target object image is an ear image of each listener. Therefore, the information transmission relation is determined, and the efficiency of obtaining information by a user is improved.

Step 203, setting a target image corresponding to each target object image contained in the image to be processed.

In order to deepen the user's experience of watching a video, the execution subject may set a target image for each target object image. The target image may be an image obtained by performing image processing on the target object image. For example, the target image may be an enlarged image, a deformed image, or the like of the target object image. The target image may also be a substitute image for the corresponding target object image. As in fig. 1, the "ear" of the second individual is characterized by a "rabbit ear". Further, the target image may also be an image to which other special effects (e.g., train whistle, flashing stars, etc.) are added.

And step 204, in response to that the target pronunciation information in the audio information is triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed.

When the target pronunciation information in the audio information is triggered (such as playing the target pronunciation information), the execution main body can display a target image corresponding to the target pronunciation information in the video to be processed. Therefore, the user can clearly determine the information transfer relationship between the target pronunciation information and the target object image, the visual effect is enhanced, and the accuracy and the effectiveness of the user for acquiring the information in the video to be processed are improved.

According to the method for displaying the image disclosed by some embodiments of the disclosure, firstly, the audio information and the image to be processed are extracted from the video to be processed, and then the corresponding relation between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed is established, so that the transfer relation of the information is clarified. And then, setting a target image corresponding to each target object image contained in the image to be processed, so that the experience of watching the video by the user is deepened. And finally, when the target pronunciation information in the audio information is triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed. The method and the device can enable the user to clarify the information transfer relation between the target pronunciation information and the target object image, enhance the visual effect and are beneficial to improving the effectiveness of the user in obtaining the information in the video to be processed.

With continued reference to fig. 3, fig. 3 illustrates a flow 300 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

step 301, extracting audio information and an image to be processed from a video to be processed.

The content of step 301 is the same as that of step 201, and is not described in detail here.

In some optional implementations of some embodiments, the extracting the audio information and the image to be processed from the video to be processed may include the following steps:

the method comprises the following steps of firstly, identifying at least one object content in the image to be processed.

The execution subject may recognize at least one object content in the image to be processed through a plurality of image recognition methods. Wherein the object content includes at least one of: face image, mouth image, ear image.

And secondly, setting a contour line for the object content in the at least one object content to obtain an object image.

In order to be able to accurately display object contents, the execution body may set an outline for each object content. The contour lines are used to identify the boundaries of the object content. That is, the object image is an image obtained by adding a contour line to the object content.

Step 302, for each pronunciation information included in the audio information, determining a trigger time of the pronunciation information based on a time stamp.

Generally, the video to be processed may include a timestamp, and the timestamp may be used to mark a playing time of the video to be processed, and may also be used to mark a corresponding relationship between the audio information and the image to be processed. The audio information typically contains pronunciation information that occurs at a corresponding time. Accordingly, the execution subject can determine the trigger time of the pronunciation information from the time stamp. Wherein, the trigger time can be used for representing the time of playing the pronunciation information. For example, if the time stamp corresponding to the trigger time of the sound information is 1 minute 20 seconds, the sound information is played at 1 minute 20 seconds.

Step 303, in response to the object image that is deformed within the set time period after the trigger time, marking the pronunciation information as target pronunciation information, and marking the object image as a target object image.

In order to establish the correspondence relationship between the pronunciation information and the object image, the execution subject may inquire the correlation between the pronunciation information and the object image in time. Generally, after the pronunciation information is played, the object image corresponding to the pronunciation information changes accordingly. For example, when the first person speaks a joke and the second person hears a joke, the target pronunciation information is the audio of the joke, and the target subject image may be the mouth image and the ear image of the second person. Although the ear images have no obvious deformation, the audio is firstly transmitted into the ears and then understood by the brain to control the mouth to deform. Therefore, the ear image can also be regarded as the target object image.

Step 304, establishing the corresponding relation between the target pronunciation information and the target object image.

After the target pronunciation information and the target object image are determined, the execution subject may establish a correspondence between the target pronunciation information and the target object image. For example, when the target pronunciation information is played for a set time (for example, 2 seconds), the target object image may be displayed.

Step 305, setting a target image corresponding to each target object image contained in the image to be processed.

Step 306, in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

The contents of steps 305 to 306 are the same as those of steps 203 to 204, and are not described in detail here.

Fig. 3 illustrates an embodiment, in which a target object image corresponding to target pronunciation information is determined by a timestamp of a video to be processed. Therefore, the corresponding relation between the target pronunciation information and the target object image is established, and the accuracy of displaying the target image when the target pronunciation information is triggered is improved.

With continued reference to fig. 4, fig. 4 illustrates a flow 400 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

step 401, extracting audio information and an image to be processed from a video to be processed.

Step 402, establishing a corresponding relationship between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed.

Steps 401 to 402 are the same as steps 201 to 202, and are not described in detail here.

And 403, amplifying the target object image according to each amplification factor in the preset amplification factor sequence to obtain a target image sequence.

The execution subject may enlarge the target object image to obtain a target image corresponding to the target object image. When the target audio information is played, the user sees the target image with the target object image enlarged. Therefore, the accuracy and effectiveness of information acquisition when the user views the video to be processed are improved. Specifically, the execution main body may amplify the target object image according to each magnification in a preset magnification sequence, so as to obtain a target image sequence. Wherein the magnifications in the magnification sequence may be sequentially increased. Further, each of the magnification in the magnification sequence may correspond to a volume amplitude of the utterance information.

Step 404, determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence for display based on the target volume amplitude.

As is apparent from the above description, the target image sequence includes the target image in which the target object image is enlarged at each magnification in the preset magnification sequence. Based on this, the execution body may establish a correspondence relationship between each magnification in the magnification sequence and the volume amplitude of the utterance information. The execution subject may determine a target volume amplitude of the target pronunciation information, and further select a target image corresponding to the target volume amplitude from the target image sequence for display. Thus, the corresponding relation between the target pronunciation information and the target object is established through the volume amplitude. When the target volume amplitude is smaller, the displayed target image is smaller; when the target volume amplitude is larger, the displayed target image is also larger. That is, the target image is different in size for different volume amplitudes, and the visual experience of the user viewing the target image is also different. Therefore, the corresponding relation between the target pronunciation information and the target image can be vividly displayed, and the accuracy and the effectiveness of information acquisition when a user watches the video to be processed are improved.

In some optional implementations of some embodiments, the selecting, based on the target volume amplitude, a corresponding target image from the target image sequence for display may include:

in the first step, an image axis and a rotation angle are set for the target image in response to the target volume amplitude being greater than a set volume threshold.

When the target volume amplitude is greater than the set volume threshold, the target volume amplitude may be considered to have exceeded the normal volume amplitude to the user's auditory effect. At this time, the execution body may set an image axis and a rotation angle for the target image. Wherein the image axis may be a line through the target image. The rotation angle may be any angle. Typically, the rotation angle may take a value between 10 ° and 30 °. The rotation angle may be set based on the image axis. For example, the image axis may be an edge of the rotation angle, may be an angle bisector of the rotation angle, and the like, according to actual needs.

And a second step of generating a first image and a second image based on the target image.

The executing subject may generate a first image and a second image from the target image. The first image and the second image may be the same as the target image. A first image axis included in the first image and a second image axis included in the second image may correspond to an image axis of the target image, respectively.

And thirdly, setting a first image axis of the first image to coincide with one side of the rotation angle, and setting a second image axis of the second image to coincide with the other side of the rotation angle.

The executive body may coincide the first image axis and the second image axis at two sides of the rotation angle, respectively. In this way, the positional relationship and the angular relationship of the first image and the second image can be determined. Further, the first image and the second image may be arranged in angular registration. The angle coincidence setting may be that when two sides of the rotation angle coincide, the first image and the second image also coincide.

And fourthly, alternately displaying the first image or the second image according to a set frequency.

The execution body may set a set frequency such that the first image and the second image are alternately displayed. Therefore, when the target volume amplitude is larger than the set volume amplitude, the target image can display a dynamic effect. For example, the target pronunciation information is a joke, and the target image is an ear image. And when the target volume amplitude of the target pronunciation information is larger than the set volume amplitude, the ear images are alternately displayed to have a vibration effect. Therefore, the user experience can be deepened, and the accuracy and effectiveness of the user for obtaining the information are improved.

In some optional implementations of some embodiments, the displaying, in response to triggering of target pronunciation information in the audio information, a target image corresponding to the target pronunciation information in the to-be-processed video may include:

firstly, acquiring a real-time image of a target object corresponding to the target pronunciation information.

In practice, the object image in the object to be processed may change over time. In order to be able to present the object image in real time. The execution subject may obtain a real-time image of the target object of the target pronunciation information. The execution body can acquire a real-time image of the target object at set time intervals. That is, the target real-time image may be a target image within a set time period after the time when the target pronunciation information is triggered. The real-time target object image can be visually understood as an image obtained by amplifying the object image in a set time period in real time. In contrast, the target image in fig. 3 is a relatively still image.

And secondly, dynamically generating a real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information.

After the real-time target object image is obtained, the execution main body can dynamically generate the real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information. Therefore, the user can acquire the real-time target image, and the timeliness and effectiveness of the user for acquiring information are improved.

firstly, determining a connection area between a target object image corresponding to the target image and the image to be processed.

The target image is an enlarged view of the target object image. The execution subject may determine a connection region between the target object image corresponding to the target image and the to-be-processed image. For example, if the target object image is an ear image, the connection region may be an image region having a size set between the ear image and the head image in the image to be processed.

And secondly, performing transition processing on the connection area based on the image to be processed and the target image.

The execution subject can perform transition processing on the connection region through the image to be processed and the target image. The transition treatment comprises at least one of the following: color transitions, line transitions, etc. Therefore, the target image and the to-be-processed image can be combined more naturally, the display effect of the to-be-processed image is improved, the watching experience of a user is deepened, and the accuracy and effectiveness of the user in obtaining information are improved.

With further reference to fig. 5, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of an apparatus for displaying an image, which correspond to those method embodiments illustrated in fig. 2, which may be particularly applicable in various electronic devices.

As shown in fig. 5, an apparatus 500 for displaying an image according to some embodiments includes: an information extraction unit 501, a relationship establishment unit 502, a target object setting unit 503, and an image display unit 504. The information extraction unit 501 is configured to extract audio information and an image to be processed from a video to be processed, wherein the audio information includes at least one piece of pronunciation information, and the image to be processed includes an object image corresponding to the pronunciation information in the at least one piece of pronunciation information; a relation establishing unit 502 configured to establish a corresponding relation between target pronunciation information included in the audio information and a target object image included in an image to be processed; a target object setting unit 503 configured to set a target image corresponding to each target object image included in the to-be-processed image, where the target image is an image obtained by image processing of the target object image; the image display unit 504, in response to the target pronunciation information in the audio information being triggered, is configured to display a target image corresponding to the target pronunciation information in the video to be processed.

In an optional implementation manner of some embodiments, the information extraction unit 501 may include: an object identifying subunit (not shown in the figure) and an object image acquiring subunit (not shown in the figure). Wherein the object identification subunit is configured to identify at least one object content in the image to be processed, and the object content includes at least one of the following: face image, mouth image, ear image; and the object image acquisition subunit is configured to set an outline for the object content to obtain an object image for the object content in the at least one object content.

In an optional implementation manner of some embodiments, the to-be-processed video includes a timestamp, and the relationship establishing unit 502 may include: a trigger time determining subunit (not shown), a marking subunit (not shown), and a relationship establishing subunit (not shown). Wherein the trigger time determination subunit is configured to determine, for each utterance information included in the audio information, a trigger time of the utterance information based on the time stamp; a marking subunit, configured to mark the pronunciation information as target pronunciation information and the object image as a target object image in response to the object image being deformed within a set time period after the trigger time; and the relation establishing subunit is configured to establish a corresponding relation between the target pronunciation information and the target object image.

In an optional implementation manner of some embodiments, the target object setting unit 503 may include: and a target object setting subunit (not shown in the figure) configured to amplify the target object image according to each magnification in a preset magnification sequence to obtain a target image sequence, wherein the magnifications in the magnification sequence are increased in order.

In an optional implementation manner of some embodiments, each magnification in the magnification sequence corresponds to a volume amplitude of the pronunciation information, and the image display unit 504 may include: and a first image display subunit (not shown in the figure) configured to determine a target volume amplitude of the target pronunciation information, and select a corresponding target image from the target image sequence for display based on the target volume amplitude.

In an optional implementation manner of some embodiments, the first image display subunit may include: a setting module (not shown), an image generating module (not shown), a position setting module (not shown), and a display module (not shown). Wherein the setting module, in response to the target volume amplitude being greater than a set volume threshold, is configured to set an image axis and a rotation angle for the target image; an image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image respectively corresponding to an image axis of the target image; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle and a second image axis of the second image to coincide with the other side of the rotation angle; and the display module is configured to alternately display the first image or the second image according to a set frequency.

In an optional implementation manner of some embodiments, the image display unit 503 may include: an image acquisition subunit (not shown in the figure) and a second image display subunit (not shown in the figure). The image acquisition subunit is configured to acquire a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image within a set time period after the time when the target pronunciation information is triggered; and the second image display subunit is configured to dynamically generate a real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information.

In an optional implementation manner of some embodiments, the image display unit 503 may include: a connection region determining subunit (not shown in the figure) and an over-processing subunit (not shown in the figure). The connection region determining subunit is configured to determine a connection region between a target object image corresponding to the target image and the image to be processed; an over-processing subunit configured to perform a transition processing on the connection region based on the image to be processed and the target image, wherein the transition processing includes at least one of: color transition, line transition.

It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one piece of pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one piece of pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an information extraction unit, a relationship establishment unit, a target object setting unit, and an image display unit. Here, the names of these units do not constitute a limitation of the unit itself in some cases, and for example, the image display unit may also be described as "a unit for displaying a target image".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

According to one or more embodiments of the present disclosure, there is provided a method of displaying an image, including: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one piece of pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one piece of pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

According to one or more embodiments of the present disclosure, the extracting audio information and an image to be processed from a video to be processed includes: identifying at least one object content in the image to be processed, wherein the object content comprises at least one of the following items: face image, mouth image, ear image; and setting a contour line for the object content in the at least one object content to obtain an object image.

According to one or more embodiments of the present disclosure, the creating a correspondence between the target pronunciation information included in the audio information and the target object image included in the image to be processed includes: determining a trigger time of the pronunciation information based on the time stamp for each pronunciation information included in the audio information; in response to the object image which is deformed within a set time period after the trigger time, marking the pronunciation information as target pronunciation information, and marking the object image as a target object image; and establishing a corresponding relation between the target pronunciation information and the target object image.

According to one or more embodiments of the present disclosure, the setting a target image corresponding to each target object image included in the image to be processed includes: and amplifying the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased progressively according to the sequence.

According to one or more embodiments of the present disclosure, each magnification in the magnification sequence corresponds to a volume amplitude of pronunciation information, and, in response to target pronunciation information in the audio information being triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed includes: and determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence to display based on the target volume amplitude.

According to one or more embodiments of the present disclosure, the selecting, based on the target volume amplitude, a corresponding target image from the target image sequence for display includes: setting an image axis and a rotation angle for the target image in response to the target volume amplitude being greater than a set volume threshold; generating a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image corresponding to an image axis of the target image, respectively; setting a first image axis of the first image to coincide with one side of the rotation angle and a second image axis of the second image to coincide with the other side of the rotation angle; and alternately displaying the first image or the second image according to a set frequency.

According to one or more embodiments of the present disclosure, the displaying a target image corresponding to target pronunciation information in the to-be-processed video in response to the target pronunciation information being triggered in the audio information includes: acquiring a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image in a set time period after the moment when the target pronunciation information is triggered; and dynamically generating a real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information.

According to one or more embodiments of the present disclosure, the displaying a target image corresponding to target pronunciation information in the to-be-processed video in response to the target pronunciation information being triggered in the audio information includes: determining a connection area between a target object image corresponding to the target image and the image to be processed; performing transition processing on the connection area based on the image to be processed and the target image, wherein the transition processing comprises at least one of the following items: color transition, line transition.

According to one or more embodiments of the present disclosure, there is provided an apparatus for displaying an image, including: an information extraction unit configured to extract audio information and an image to be processed from a video to be processed, the audio information including at least one pronunciation information, the image to be processed including an object image corresponding to the pronunciation information in the at least one pronunciation information; the relation establishing unit is configured to establish a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; a target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by image processing of the target object image; and the image display unit is used for responding to the triggering of the target pronunciation information in the audio information and displaying a target image corresponding to the target pronunciation information in the video to be processed.

According to one or more embodiments of the present disclosure, the information extracting unit includes: an object identification subunit configured to identify at least one object content in the image to be processed, the object content including at least one of: face image, mouth image, ear image; and the object image acquisition subunit is configured to set an outline for the object content to obtain an object image for the object content in the at least one object content.

According to one or more embodiments of the present disclosure, the video to be processed includes a timestamp, and the relationship establishing unit includes: a trigger time determining subunit configured to determine, for each of the utterance information included in the audio information, a trigger time of the utterance information based on the time stamp; a marking subunit, configured to mark the pronunciation information as target pronunciation information and the object image as a target object image in response to the object image being deformed within a set time period after the trigger time; and the relation establishing subunit is configured to establish a corresponding relation between the target pronunciation information and the target object image.

According to one or more embodiments of the present disclosure, the target object setting unit includes: and the target object setting subunit is configured to amplify the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

According to one or more embodiments of the present disclosure, each of the magnifications in the magnification sequence corresponds to a volume amplitude of the pronunciation information, and the image display unit includes: and the first image display subunit is configured to determine a target volume amplitude of the target pronunciation information, and select a corresponding target image from the target image sequence for display based on the target volume amplitude.

According to one or more embodiments of the present disclosure, the first image display subunit includes: a setting module configured to set an image axis and a rotation angle for the target image in response to the target volume amplitude being greater than a set volume threshold; an image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image respectively corresponding to an image axis of the target image; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle and a second image axis of the second image to coincide with the other side of the rotation angle; and the display module is configured to alternately display the first image or the second image according to a set frequency.

According to one or more embodiments of the present disclosure, the above image display unit includes: the image acquisition subunit is configured to acquire a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image within a set time period after the time when the target pronunciation information is triggered; and the second image display subunit is configured to dynamically generate a real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information.

According to one or more embodiments of the present disclosure, the above image display unit includes: a connection region determining subunit configured to determine a connection region between a target object image corresponding to the target image and the image to be processed; an over-processing subunit configured to perform a transition processing on the connection region based on the image to be processed and the target image, wherein the transition processing includes at least one of: color transition, line transition.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of displaying an image, comprising:

extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one piece of pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one piece of pronunciation information;

establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed;

setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained after image processing is carried out on the target object image;

and in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

2. The method of claim 1, wherein said extracting audio information and images to be processed from video to be processed comprises:

identifying at least one object content in the image to be processed, the object content comprising at least one of: face image, mouth image, ear image;

and setting a contour line for the object content in the at least one object content to obtain an object image.

3. The method of claim 1, wherein the video to be processed includes a timestamp, and

the establishing of the corresponding relation between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed includes:

for each pronunciation information contained in the audio information, determining a trigger time of the pronunciation information based on the time stamp;

in response to the existence of an object image which deforms within a set time period after the trigger time, marking the pronunciation information as target pronunciation information, and marking the object image as a target object image;

and establishing a corresponding relation between the target pronunciation information and the target object image.

4. The method according to claim 1, wherein the setting of the target image corresponding to each target object image included in the image to be processed comprises:

and amplifying the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased progressively according to the sequence.

5. The method of claim 4, wherein each magnification in the sequence of magnifications corresponds to a volume magnitude of the pronunciation information, an

The displaying a target image corresponding to the target pronunciation information in the video to be processed in response to the target pronunciation information in the audio information being triggered comprises:

and determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence to display based on the target volume amplitude.

6. The method of claim 5, wherein the selecting for display a corresponding target image from the sequence of target images based on the target volume magnitude comprises:

setting an image axis and a rotation angle for the target image in response to the target volume amplitude being greater than a set volume threshold;

generating a first image and a second image based on the target image, wherein the first image and the second image are the same as the target image, and a first image axis contained in the first image and a second image axis contained in the second image respectively correspond to an image axis of the target image; setting a first image axis of the first image to coincide with one side of the rotation angle, and setting a second image axis of the second image to coincide with the other side of the rotation angle;

and alternately displaying the first image or the second image according to a set frequency.

7. The method of claim 5, wherein the displaying a target image corresponding to target pronunciation information in the video to be processed in response to the target pronunciation information being triggered comprises:

acquiring a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image in a set time period after the moment when the target pronunciation information is triggered;

and dynamically generating a real-time target image of the real-time target object image according to the target volume amplitude of the target pronunciation information.

8. The method according to any one of claims 1 to 7, wherein the displaying a target image corresponding to target pronunciation information in the video to be processed in response to the target pronunciation information in the audio information being triggered comprises:

determining a connection area between a target object image corresponding to the target image and the image to be processed;

performing transition processing on a connection region based on the image to be processed and the target image, wherein the transition processing comprises at least one of the following steps: color transition, line transition.

9. An apparatus for displaying an image, comprising:

an information extraction unit configured to extract audio information and an image to be processed from a video to be processed, the audio information including at least one pronunciation information, the image to be processed including an object image corresponding to the pronunciation information in the at least one pronunciation information;

the relation establishing unit is configured to establish a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed;

a target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by image processing of the target object image;

and the image display unit is used for responding to the triggering of the target pronunciation information in the audio information and is configured to display a target image corresponding to the target pronunciation information in the video to be processed.

10. The apparatus of claim 9, wherein the information extraction unit comprises:

an object identification subunit configured to identify at least one object content in the image to be processed, the object content comprising at least one of: face image, mouth image, ear image;

and the object image acquisition subunit is used for setting a contour line for the object content in the at least one object content to obtain an object image.

11. The apparatus of claim 9, wherein the video to be processed comprises a timestamp, and

the relationship establishing unit includes:

a trigger time determining subunit configured to determine, for each of the utterance information contained in the audio information, a trigger time of the utterance information based on the time stamp;

a marking subunit, configured to mark the pronunciation information as target pronunciation information and the object image as a target object image in response to the existence of the object image deformed within a set time period after the trigger time;

a relationship establishing subunit configured to establish a correspondence relationship between the target pronunciation information and a target object image.

12. The apparatus of claim 9, wherein the target object setting unit comprises:

and the target object setting subunit is configured to amplify the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

13. The apparatus of claim 12, wherein each magnification in the sequence of magnifications corresponds to a volume magnitude of the pronunciation information, and

the image display unit includes:

and the first image display subunit is configured to determine a target volume amplitude of the target pronunciation information, and select a corresponding target image from the target image sequence for display based on the target volume amplitude.

14. The apparatus of claim 13, wherein the first image display subunit comprises:

a setting module, responsive to the target volume amplitude being greater than a set volume threshold, configured to set an image axis and a rotation angle for the target image;

an image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image respectively corresponding to an image axis of the target image; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle, and a second image axis of the second image to coincide with the other side of the rotation angle;

a display module configured to alternately display the first image or the second image according to a set frequency.

15. The apparatus of claim 13, wherein the image display unit comprises:

the image acquisition subunit is configured to acquire a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image within a set time period after the time when the target pronunciation information is triggered;

a second image display subunit configured to dynamically generate a real-time target image of the real-time image of the target object according to a target volume amplitude of the target pronunciation information.

16. The apparatus according to any one of claims 9 to 15, wherein the image display unit comprises:

a connection region determination subunit configured to determine a connection region between a target object image corresponding to the target image and the image to be processed;

an over-processing subunit configured to perform a transition processing on a connection region based on the image to be processed and a target image, the transition processing including at least one of: color transition, line transition.

17. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

18. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 8.