CN113392272A - Method and device for voice marking of pictures and videos - Google Patents

Method and device for voice marking of pictures and videos Download PDF

Info

Publication number
CN113392272A
CN113392272A CN202010167913.8A CN202010167913A CN113392272A CN 113392272 A CN113392272 A CN 113392272A CN 202010167913 A CN202010167913 A CN 202010167913A CN 113392272 A CN113392272 A CN 113392272A
Authority
CN
China
Prior art keywords
voice
interface
picture
text
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010167913.8A
Other languages
Chinese (zh)
Inventor
王中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010167913.8A priority Critical patent/CN113392272A/en
Priority to PCT/CN2021/080145 priority patent/WO2021180155A1/en
Publication of CN113392272A publication Critical patent/CN113392272A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

An embodiment of the present specification provides a method for voice tagging of a picture, where the method includes: firstly, displaying a recording interface comprising a target picture; then, responding to a recording starting instruction sent out based on the recording interface, and continuously acquiring voice signals; then, responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signal as an audio file; then, adding a voice mark associated with the audio file on the target picture.

Description

Method and device for voice marking of pictures and videos
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for voice tagging a picture, a method and an apparatus for voice tagging a video, a method and an apparatus for text tagging a picture, a method and an apparatus for text tagging a video, a method and an apparatus for viewing a picture, and a method and an apparatus for viewing a video.
Background
In many scenarios, it is desirable to account for content in a picture or video. For example, to introduce a usage mode of a certain website, an operation icon in a screenshot of a web page needs to be described. For another example, to introduce a certain product, the components of the product need to be described in conjunction with a picture taken of the product.
However, currently, the way for explaining the specific content in the picture or video available for the user to select, for example, the user uses the picture editing software to supplement the explanatory text by adding the text box, is relatively single, and has a relatively high operation cost.
Therefore, a solution is needed to improve the convenience when the explanation is needed based on the content in the picture or the video, so as to improve the user experience.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method for voice marking a picture, which can conveniently and quickly interpret and explain the picture by performing a recording at any position on the picture, i.e., a voice marking function.
According to a first aspect, there is provided a method of voice tagging a picture, the method comprising: displaying a recording interface comprising a target picture; continuously acquiring voice signals in response to a recording start instruction sent based on the recording interface; responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signals as audio files; and adding a voice mark associated with the audio file on the target picture.
According to a second aspect, there is provided a picture viewing method, the method comprising: displaying a picture with a voice tag, wherein the voice tag is added to the picture by the method provided by the first aspect; and responding to a triggering instruction of the voice mark, and playing an audio file associated with the voice mark.
According to a third aspect, there is provided a method of voice tagging video, the method comprising: displaying a recording interface comprising a first video, the first video comprising a first video frame; responding to a selection instruction of the first video frame, and determining the first video frame as a target video frame; continuously acquiring voice signals in response to a recording start instruction sent based on the recording interface; responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signals as audio files; and adding a voice mark associated with the audio file on the target video frame.
According to a fourth aspect, there is provided a video viewing method, the method comprising: displaying a video with a voice tag added to the video by the method provided by the third aspect, wherein the video comprises a first video frame with a first voice tag; and responding to a triggering instruction of the first voice mark, and playing an audio file associated with the first voice mark.
According to a fifth aspect, there is provided a method of text tagging a picture, the method comprising: displaying an editing interface aiming at the target picture; responding to a recording starting instruction sent out based on the editing interface, and continuously acquiring voice signals; responding to a recording ending instruction sent out based on the editing interface, and performing voice recognition on the collected voice signal to obtain a recognition text; and adding a text mark associated with the identification text on the target picture.
According to a sixth aspect, there is provided a picture viewing method, the method comprising: displaying a picture with text labels, wherein the text labels are added to the picture by the method provided by the fifth aspect; and displaying the identification text associated with the text mark in response to a triggering instruction of the text mark.
According to a seventh aspect, there is provided a method of text tagging a video, the method comprising: displaying an editing interface comprising a first video, the first video comprising a first video frame; responding to a selection instruction of the first video frame, and determining the first video frame as a target video frame; responding to a recording starting instruction sent out based on the editing interface, and continuously acquiring voice signals; responding to a recording ending instruction sent out based on the editing interface, and performing voice recognition on the collected voice signal to obtain a recognition text; and adding a text mark associated with the identification text on the target video frame.
According to an eighth aspect, there is provided a video viewing method, the method comprising: displaying a video with a text label, wherein the voice label is added to the video by the method provided by the seventh aspect, and the video comprises a first video frame with a first text label; and in response to a triggering instruction of the first text mark, displaying the identification text associated with the text mark.
According to a ninth aspect, there is provided an apparatus for voice tagging a picture, the apparatus comprising: the display unit is configured to display a recording interface comprising a target picture; the acquisition unit is configured to respond to a recording start instruction sent out based on the recording interface and continuously acquire voice signals; the storage unit is configured to respond to a recording ending instruction sent out based on the recording interface and store the collected voice signals as audio files; an adding unit configured to add a voice tag associated with the audio file on the target picture.
According to a tenth aspect, there is provided a picture viewing apparatus comprising: a display unit configured to display a picture with a voice tag added to the picture by the apparatus provided in the ninth aspect; and the playing unit is configured to respond to a triggering instruction of the voice mark and play the audio file associated with the voice mark.
According to an eleventh aspect, there is provided an apparatus for voice tagging video, the apparatus comprising: a display unit configured to display a recording interface including a first video, the first video including a first video frame; the determining unit is configured to respond to a selection instruction of the first video frame, and determine the first video frame as a target video frame; the acquisition unit is configured to respond to a recording start instruction sent out based on the recording interface and continuously acquire voice signals; the storage unit is configured to respond to a recording ending instruction sent out based on the recording interface and store the collected voice signals as audio files; an adding unit configured to add a voice tag associated with the audio file on the target video frame.
According to a twelfth aspect, there is provided a video viewing device, the device comprising: a display unit configured to display a video with a voice tag added to the video by the apparatus provided in the eleventh aspect, wherein the video includes a first video frame with a first voice tag; and the playing unit is configured to respond to a triggering instruction of the first voice mark and play the audio file associated with the first voice mark.
According to a thirteenth aspect, there is provided an apparatus for text marking of a picture, the apparatus comprising: a display unit configured to display an editing interface for a target picture; the acquisition unit is configured to respond to a recording starting instruction sent out based on the editing interface and continuously acquire voice signals; the recognition unit is configured to respond to a recording ending instruction sent out based on the editing interface, perform voice recognition on the collected voice signal and obtain a recognition text; an adding unit configured to add a text label associated with the recognition text on the target picture.
According to a fourteenth aspect, there is provided a picture viewing apparatus, the apparatus comprising: a display unit configured to display a picture with text labels added to the picture by the apparatus provided in the thirteenth aspect; and the display unit is configured to respond to a triggering instruction of the text mark and display the identification text associated with the text mark.
According to a fifteenth aspect, there is provided an apparatus for text marking a video, the apparatus comprising: a display unit configured to display an editing interface including a first video, the first video including a first video frame; the determining unit is configured to respond to a selection instruction of the first video frame, and determine the first video frame as a target video frame; the acquisition unit is configured to respond to a recording starting instruction sent out based on the editing interface and continuously acquire voice signals; the recognition unit is configured to respond to a recording ending instruction sent out based on the editing interface, perform voice recognition on the collected voice signal and obtain a recognition text; an adding unit configured to add a text label associated with the recognition text on the target video frame.
According to a sixteenth aspect, there is provided a video viewing device, the device comprising: a display unit configured to display a video with a text label, wherein the voice label is added to the video through the apparatus provided in the fifteenth aspect, and the video comprises a first video frame with a first text label; the display unit is configured to respond to a trigger instruction of the first text mark and display the identification text associated with the text mark.
According to a seventeenth aspect, there is provided a picture processing method comprising: displaying a chat interface, and receiving a target picture to be sent selected based on the chat interface; responding to an editing instruction aiming at the target picture, entering a picture editing interface, wherein a voice mark icon is displayed; responding to a triggering instruction of the voice mark icon, and entering a recording interface; and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
According to an eighteenth aspect, there is provided a picture processing method, including: displaying a chat interface, wherein a chat window of the chat interface comprises a target picture; responding to a trigger instruction aiming at the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; responding to a triggering instruction of the voice mark icon, and entering a recording interface; and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
According to a nineteenth aspect, there is provided a picture processing method comprising: displaying a chat interface, wherein a chat window of the chat interface comprises a target picture with a first voice mark, and the first voice mark is added by a current contact corresponding to the chat window; responding to a trigger instruction of the first voice mark, and displaying a menu bar, wherein the menu bar comprises a voice reply icon; responding to a triggering instruction of the voice reply icon, and entering a recording interface; storing the voice signals collected based on the recording interface as an audio file, and adding a second voice mark related to the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
According to a twentieth aspect, there is provided a picture processing method comprising: displaying a picture editing interface containing a target picture, wherein a function menu of the picture editing interface comprises a voice mark icon; responding to a triggering instruction of the voice mark icon, and entering a voice mark interface; and converting the input text received based on the voice mark interface into an audio file, and adding a voice mark related to the audio file on the target picture.
According to a twenty-first aspect, there is provided a picture processing method, comprising: displaying a chat interface containing a target picture; responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice adding emoticon; responding to a triggering instruction of the voice adding emoticon, and entering a voice adding emoticon interface; converting the voice signals collected based on the voice adding expression interface into characters, and generating animation expressions based on the characters; and adding the animation expression in the target picture.
According to a twenty-second aspect, there is provided a picture processing method, the execution subject of the method being an e-commerce platform, the method comprising: displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture for a target commodity; responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; responding to a triggering instruction of the voice mark icon, and entering a recording interface; and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
According to a twenty-third aspect, there is provided a picture processing method comprising: displaying an order evaluation interface for the first order, wherein the order evaluation interface comprises an added picture icon; responding to the trigger instruction aiming at the picture adding icon, and receiving a selected target picture; responding to a voice marking instruction sent to the target picture, and entering a recording interface; and storing the voice signal acquired based on the recording interface as an audio file, and adding a voice mark associated with the audio file to the target electronic file.
According to a twenty-fourth aspect, there is provided a picture processing method comprising: displaying a commodity evaluation interface aiming at a target commodity, wherein the commodity evaluation interface comprises first user evaluation, and the first user evaluation comprises a target picture with a first voice mark; responding to a trigger instruction aiming at the target picture, and displaying a menu bar, wherein the menu bar comprises a voice reply icon; responding to a triggering instruction of the voice reply icon, and entering a recording interface; storing the voice signals collected based on the recording interface as an audio file, and adding a second voice mark related to the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
According to a twenty-fifth aspect, there is provided a picture processing method, an execution subject of the method being a customer service platform, the method comprising: receiving a conversation message sent by a user, wherein the conversation message comprises a target picture with a first voice mark; acquiring an audio file associated with the first voice mark, and performing voice recognition on the audio file to obtain a recognition text; inputting the recognition text into a pre-trained user question prediction model, and outputting a corresponding user standard question; and feeding back the question answers corresponding to the user standard questions to the user.
According to a twenty-sixth aspect, there is provided a method of processing an electronic file, comprising: displaying a file processing interface aiming at a target electronic file, wherein a function menu bar of the file processing interface comprises a voice mark icon; responding to a triggering instruction of the voice mark icon, and entering a recording interface; and storing the voice signal acquired based on the recording interface as an audio file, and adding a voice mark associated with the audio file to the target electronic file.
According to a twenty-seventh aspect, there is provided a picture processing method, an execution subject of the method being a live platform, the method including: displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture of a target commodity to be placed on a shelf; responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; responding to a triggering instruction of the voice mark icon, and entering a recording interface; and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
According to a twenty-eighth aspect, there is provided a picture processing apparatus comprising: a display unit configured to display a chat interface; the receiving unit is configured to receive a target picture to be sent, which is selected based on the chat interface; the first interface switching unit is configured to respond to an editing instruction aiming at the target picture, enter a picture editing interface, and display a voice mark icon; the second interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
According to a twenty-ninth aspect, there is provided a picture processing apparatus comprising: the interface display unit is configured to display a chat interface, and a chat window of the chat interface comprises a target picture; the menu bar display unit is configured to respond to a trigger instruction aiming at the target picture and display a menu bar, and the menu bar comprises a voice mark icon; the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
According to a thirtieth aspect, there is provided a picture processing apparatus comprising: the interface display unit is configured to display a chat interface, a chat window of the chat interface comprises a target picture with a first voice mark, and the first voice mark is added by a current contact corresponding to the chat window; the menu bar display unit is configured to respond to a trigger instruction of the first voice mark and display a menu bar, and the menu bar comprises a voice reply icon; the interface switching unit is configured to respond to a trigger instruction of the voice reply icon and enter a recording interface; the marking unit is configured to store the voice signals collected based on the recording interface as audio files, and add second voice marks related to the audio files in the area, adjacent to the first voice marks, of the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
According to a thirty-first aspect, there is provided a picture processing apparatus, the apparatus comprising: the display unit is configured to display a picture editing interface containing a target picture, and a function menu of the picture editing interface comprises a voice mark icon; the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a voice mark interface; and the marking unit is configured to convert the input text received based on the voice marking interface into an audio file and add a voice mark related to the audio file on the target picture.
According to a thirty-second aspect, there is provided a picture processing apparatus comprising: the interface display unit is configured to display a chat interface containing the target picture; the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice adding emoticon; the interface switching unit is configured to respond to a triggering instruction of the voice adding emoticon and enter a voice adding emoticon interface; the expression generating unit is configured to convert the voice signals collected based on the voice expression adding interface into characters and generate animation expressions based on the characters; and the expression adding unit is configured to add the animation expression in the target picture.
According to a thirty-third aspect, there is provided a picture processing apparatus, the apparatus being integrated with an e-commerce platform, the apparatus comprising: the interface display unit is configured to display a commodity information editing interface, wherein the commodity information editing interface comprises a target picture aiming at a target commodity; the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice mark icon; the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
According to a thirty-fourth aspect, there is provided a picture processing apparatus comprising: the display unit is configured to display an order evaluation interface for the first order, wherein the order evaluation interface comprises an added picture icon; the receiving unit is configured to respond to the trigger instruction for the picture adding icon and receive the selected target picture; the interface switching unit is configured to respond to a voice marking instruction sent to the target picture and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files to the target electronic files.
According to a thirty-fifth aspect, there is provided a picture processing apparatus comprising: the interface display unit is configured to display a commodity evaluation interface for a target commodity, wherein the commodity evaluation interface comprises a first user evaluation, and the first user evaluation comprises a target picture with a first voice mark; the menu bar display unit is configured to respond to a trigger instruction aiming at the target picture and display a menu bar, and the menu bar comprises a voice reply icon; the interface switching unit is configured to respond to a trigger instruction of the voice reply icon and enter a recording interface; the marking unit is configured to store the voice signals collected based on the recording interface as audio files, and add second voice marks related to the audio files in the area, adjacent to the first voice marks, of the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
According to a thirty-sixth aspect, there is provided a picture processing apparatus, the apparatus being integrated in a customer service platform, the apparatus comprising: the receiving unit is configured to receive a conversation message sent by a user, wherein the conversation message comprises a target picture with a first voice mark; the acquisition unit is configured to acquire an audio file associated with the first voice tag, and perform voice recognition on the audio file to obtain a recognition text; the prediction unit is configured to input the recognition text into a pre-trained user question prediction model and output a corresponding user standard question; a feedback unit configured to feed back a question answer corresponding to the user standard question to the user.
According to a thirty-seventh aspect, there is provided an electronic file processing apparatus comprising: the display unit is configured to display a file processing interface aiming at a target electronic file, and a function menu bar of the file processing interface comprises a voice mark icon; the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files to the target electronic files.
According to a thirty-eighth aspect, there is provided a picture processing device, the device being integrated in a live platform, the device comprising: the interface display unit is configured to display a commodity information editing interface, wherein the commodity information editing interface comprises a target picture of a target commodity to be placed on a shelf; the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice mark icon; the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
According to a thirty-ninth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the above first to eighth, seventeenth to twenty-seventh aspects.
According to a fortieth aspect, there is provided a computing device comprising a memory having stored therein executable code, and a processor which, when executing the executable code, implements the method of any one of the first to eighth aspects, seventeenth to twenty seventh aspects.
In summary, in the method and the apparatus for voice tagging of pictures or videos disclosed in the embodiments of the present disclosure, the pictures or videos can be conveniently and quickly interpreted and explained by performing a recording at any position on the pictures or videos, that is, a voice tagging function.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 illustrates an application scenario diagram of a voice tagging function, according to one embodiment;
FIG. 2 illustrates a flow diagram of a method of voice tagging a picture, according to one embodiment;
FIG. 3 illustrates a switching diagram for switching to a recording interface, according to one embodiment;
FIG. 4 illustrates a switching diagram of switching to a recording interface according to another embodiment;
FIG. 5 illustrates an interface switch diagram when using voice markup functionality, according to one embodiment;
FIG. 6 illustrates an interface switch diagram when using voice markup functionality according to another embodiment;
FIG. 7 illustrates an interface switch diagram for sequence number modification in a voice markup icon, according to one embodiment;
FIG. 8 illustrates an interface diagram including subject text in a voice markup icon, according to one embodiment;
FIG. 9 illustrates an interface switch diagram when playing an audio file, according to one embodiment;
FIG. 10 illustrates a speech recognition related interface switch diagram according to one embodiment;
FIG. 11 illustrates a flow diagram of a method of viewing pictures, according to one embodiment;
FIG. 12 illustrates an item detail interface including a picture of an item with a voice icon, according to one embodiment;
FIG. 13 illustrates a flow diagram of a method of voice tagging video, according to one embodiment;
FIG. 14 illustrates a voice markup interface diagram for a video, according to one embodiment;
FIG. 15 shows a voice markup interface diagram for a video according to another embodiment;
FIG. 16 shows a flow diagram of a video viewing method according to one embodiment;
FIG. 17 illustrates an interface switch diagram for viewing voice tagged video, according to one embodiment;
FIG. 18 illustrates a flow diagram of a method of text tagging a picture, according to one embodiment;
FIG. 19 illustrates an interface switch diagram when text marking a picture, according to one embodiment;
FIG. 20 illustrates a flowchart of a picture viewing method according to one embodiment;
FIG. 21 illustrates an interface switch diagram for viewing text labels in a picture, according to one embodiment;
FIG. 22 illustrates a flow diagram of a method of text tagging a video, according to one embodiment;
FIG. 23 illustrates a schematic diagram of interface switching when text marking is performed on a video, according to one embodiment;
FIG. 24 shows a flow diagram of a video viewing method according to one embodiment;
FIG. 25 illustrates an interface switch diagram to view text labels in a video, according to one embodiment;
FIG. 26 illustrates a block diagram of an apparatus for voice tagging a picture, according to one embodiment;
FIG. 27 shows a diagram of a picture viewing device structure, according to one embodiment;
FIG. 28 illustrates a block diagram of an apparatus for voice tagging video, according to one embodiment;
FIG. 29 shows a video viewing device structure diagram according to one embodiment;
FIG. 30 illustrates a block diagram of an apparatus for text tagging of pictures, according to one embodiment;
FIG. 31 illustrates a diagram of a picture viewing device structure, according to one embodiment;
FIG. 32 is a diagram illustrating the structure of an apparatus for text marking a video according to one embodiment;
FIG. 33 shows a video viewing device structure diagram according to one embodiment;
FIG. 34 shows a flow diagram of a picture processing method according to one embodiment;
FIG. 35 illustrates a chat interface diagram according to an embodiment;
FIG. 36 shows a flow diagram of a method of picture processing according to another embodiment;
FIG. 37 is a diagram illustrating a chat interface, according to another embodiment;
FIG. 38 is a diagram illustrating a chat interface, according to yet another embodiment;
FIG. 39 shows a flow diagram of a picture processing method according to yet another embodiment;
FIG. 40 illustrates a chat interface diagram according to yet another embodiment;
FIG. 41 is a diagram illustrating a chat interface, according to yet another embodiment;
FIG. 42 shows a flow diagram of a picture processing method according to yet another embodiment;
FIG. 43 shows a flow diagram of a picture processing method according to one embodiment;
FIG. 44 illustrates a chat interface diagram according to an embodiment;
FIG. 45 shows a flow diagram of a method of picture processing according to another embodiment;
FIG. 46 illustrates a merchandise information editing interface diagram, according to an embodiment;
FIG. 47 shows a flowchart of a picture processing method according to yet another embodiment;
FIG. 48 shows a flow diagram of a picture processing method according to yet another embodiment;
FIG. 49 shows a flow diagram of a picture processing method according to yet another embodiment;
FIG. 50 illustrates a flowchart of a method of processing an electronic file, according to one embodiment;
FIG. 51 illustrates an office software interface diagram according to one embodiment;
FIG. 52 shows a flow diagram of a picture processing method according to one embodiment;
FIG. 53 illustrates a live interface diagram according to one embodiment;
FIG. 54 shows a diagram of a picture processing device architecture according to one embodiment;
FIG. 55 is a diagram showing a configuration of a picture processing apparatus according to another embodiment;
FIG. 56 is a diagram showing a construction of a picture processing apparatus according to still another embodiment;
FIG. 57 is a diagram showing a configuration of a picture processing apparatus according to still another embodiment;
FIG. 58 is a diagram showing a configuration of a picture processing apparatus according to still another embodiment;
FIG. 59 shows a diagram of a picture processing device architecture according to one embodiment;
FIG. 60 is a diagram showing a construction of a picture processing apparatus according to still another embodiment;
FIG. 61 is a view showing the construction of a picture processing apparatus according to another embodiment;
FIG. 62 is a diagram showing a configuration of a picture processing apparatus according to still another embodiment;
FIG. 63 illustrates a block diagram of a processing device for electronic files, according to one embodiment;
fig. 64 shows a diagram of a picture processing apparatus according to still another embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
As previously mentioned, in many scenarios, a specific content in a picture or video needs to be illustrated. For example, if a description based on a picture is required (for example, a parent is instructed to operate a page of a treasure), the current chat tool can only use the form of the picture plus characters for description, the operation cost for editing the characters on the picture is high, and the original content in the picture is easily blocked by large segments of the characters. For another example, on the product detail page, if a product in a picture needs to be explained, the picture is usually placed first, and then the product is explained by using characters below the picture.
Based on this, the inventor proposes to design a voice marking function, which combines a voice recording function with a picture marking function to realize directional explanation of contents on electronic media such as pictures and videos. In an embodiment, fig. 1 is a schematic view illustrating an application scenario of a voice tagging function according to an embodiment, as shown in fig. 1, first, a user a may add voice tags at several positions of a picture through the voice tagging function in a process of using a client a (e.g., an instant messaging client), for example, trigger a recording by performing a long-press operation (e.g., pressing with a finger in fig. 1) in the picture, end the recording by canceling the press operation, and add voice tags at positions corresponding to the long-press operation, e.g., add a voice tag 10 and a voice tag 11, respectively; then, responding to a click instruction of the sending icon 12, and sending the picture with the voice mark to the client B; then, in the process of viewing the picture through the client B, the user B may listen to the corresponding recording file by clicking the voice tag on the picture, for example, by clicking the voice tag 10, the corresponding recording file may be listened to, and at the same time, the display state of the voice tag 10 is switched to the voice tag 13, so as to prompt the user that the corresponding recording file is currently played.
Therefore, for a user who adds a voice mark to the picture, the user can conveniently and quickly make a directive explanation on the content at any position in the picture; for a user viewing a picture with a voice mark, the content of the user needing to know about the picture can be intuitively and conveniently acquired. Therefore, the voice marking function provided by the inventor can fully improve the experience of each user.
The following describes an implementation of the voice tagging function with reference to a specific embodiment.
In particular, fig. 2 shows a flowchart of a method for voice tagging a picture according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, and apparatus cluster with computing and processing capabilities, for example, may be a client (e.g., picture processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 2, the method comprises the steps of:
step S210, displaying a recording interface comprising a target picture; step S220, responding to a recording start instruction sent out based on the recording interface, and continuously acquiring voice signals; step S230, responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signal as an audio file; step S240, adding a voice tag associated with the audio file on the target picture.
The steps are as follows:
first, in step S210, a recording interface including a target picture is displayed.
In one embodiment, this step may include: and responding to an import instruction aiming at the target picture sent based on a recording interface, and displaying the target picture in the recording interface. In a specific embodiment, the recording interface may be an editing interface in the picture processing software, or a goods information editing interface in the e-commerce platform, or an editing interface for a picture to be sent in the instant messaging client. In a specific embodiment, the import command may be a voice control command or a click command, such as a click command on an import icon. In a specific example, the recording interface is accessed in response to an opening instruction of a user for the picture processing APP, the pictures in the gallery are displayed in response to a click instruction for the picture import icon in the recording interface, and then a certain picture is displayed on the recording interface in response to a selection instruction and an import confirmation instruction for the certain picture by the user.
In one embodiment, prior to this step, the method may further comprise: responding to a trigger instruction aiming at the target picture, and displaying a voice mark icon; accordingly, this step may include: and responding to a triggering instruction of the voice mark icon, and jumping to the recording interface.
In a specific embodiment, the triggering instruction for the target picture may include a viewing instruction, or an editing instruction, or a sending instruction. In another specific embodiment, the trigger instruction for the target picture may include a screen capture instruction for capturing the target picture, that is, the trigger instruction is a screen capture instruction, and the captured picture is the target picture. According to a specific example, a chat interface 31 is shown in fig. 3, in response to a further operation instruction (which may correspond to a long press operation) of a picture 32 therein, displaying an operation menu bar 33, including a voice mark icon 34. Accordingly, in response to a click instruction for the voice mark icon 34, jumping to the recording interface 35 is performed.
In another specific embodiment, before displaying the voice mark icon in response to the triggering instruction for the target picture, the method may further include: displaying a playing interface aiming at the first video; further, in a more specific embodiment, in response to a triggering instruction for the target picture, displaying the voice mark icon may include: and responding to a jump instruction sent aiming at the first video, determining a video frame subjected to jump display as the target picture, and displaying the voice mark icon on the playing interface. In another more specific embodiment, in response to a triggering instruction for a target picture, displaying a voice markup icon may include: and responding to a pause instruction sent for the first video, determining a video frame which is paused to be displayed as the target picture, and displaying the voice mark icon on the playing interface.
According to a specific example, as shown in fig. 4, a playing interface 41 for a first video is displayed, a voice tag icon 42 is displayed in the playing interface in response to a pause instruction issued to the first video, and then, in response to a click instruction to the voice tag icon 42, a recording interface 43 (or a voice tag interface 43) is jumped to, which includes a target picture 44 as a video frame in the paused playing interface. Therefore, the method can realize that the voice mark icon is jumped to the recording interface in response to the triggering instruction of the voice mark icon.
On the other hand, in one embodiment, this step may further include: and displaying prompt information in the recording interface, wherein the prompt information is used for prompting a user of an operation mode of adding a voice mark to the target picture. In one example, as shown in fig. 3, the prompt message 36 displayed by the recording interface 35 includes the following contents: please press a certain position on the picture to start recording, and generate a voice mark at the position.
In the above, a recording interface including the target picture may be displayed. Next, in step S220, in response to a recording start instruction issued based on the recording interface, a voice signal is continuously acquired. And, in step S230, in response to a recording end instruction issued based on the recording interface, storing the acquired voice signal as an audio file, and in step S240, adding a voice tag associated with the audio file on the target picture.
In one embodiment, the recording start command and the recording end command may correspond to a plurality of operation modes, respectively. In a specific embodiment, the recording start instruction may correspond to: a long press operation on the target picture, and wherein the recording end instruction corresponds to: and canceling the pressing operation of the target picture. In another specific embodiment, the recording interface includes a recording icon, where the recording start instruction may correspond to: the recording icon in the first state is clicked, and in response to the clicking, the recording icon is switched and displayed to the second state, and accordingly, the recording end instruction may correspond to: and clicking the sound recording icon in the second state, and switching and displaying the sound recording icon to be in the first state in response to the clicking operation. In a further specific embodiment, the recording start instruction corresponds to: and for the click operation of the right mouse button, the recording ending instruction corresponds to: and clicking the right mouse button again. In a further specific embodiment, the recording start instruction corresponds to: and a trigger instruction of a recording start icon in the recording interface, wherein the recording end instruction corresponds to: and triggering an instruction of a recording ending icon in the recording interface.
Further, in one embodiment, step S220 may include: and responding to a recording starting instruction sent out based on the first position of the target picture in the recording interface, and continuously acquiring voice signals. It should be noted that the first position may be any position in the target picture. Accordingly, step S240 may include: and adding the voice mark at a first position of the target picture.
According to an example, as shown in fig. 5, in the recording interface, in response to a long press operation made to a first position 52 in a target picture 51, recording is started, and then, in response to a cancel of the long press operation, recording is ended, and a voice mark 53 related to recorded audio is added to the first position 52. In this manner, the addition of a voice tag at a specified first location may be achieved.
In another embodiment, step S240 may include: and adding the voice mark at any position in the target picture. In a specific embodiment, the arbitrary position may be a random position determined according to a random algorithm, or may be a default fixed position of the system, such as a central region of the target picture. Further, in one embodiment, after step S240, the method may further include: and responding to a moving instruction of the voice mark, and moving the voice mark to a specified position in the target picture.
According to an example, as shown in fig. 6, a target picture 61 is included in a recording interface, recording is started in response to a click instruction on a recording icon 62 in a first state, and then recording is ended in response to a click instruction on a recording icon 63 in a second state, and a voice mark is added to a center area 64 of the picture. Further, in response to the moving instruction for the voice mark 64, the voice mark 64 is moved to the target area 65. In this way, displaying the voice mark at the designated position can be realized.
It should be noted that, by repeatedly performing the above steps S220 to S240, a plurality of voice tags can be added to the target picture, and the voice tags are located at different positions in the target picture.
On the other hand, in one embodiment, step S240 may further include: displaying a sequence number in the voice tag. In a specific embodiment, the sequence number may be automatically generated, and may specifically be determined based on the number of previous voice tags added to the target picture. In a more specific embodiment, assuming that n (natural number) voice tags are already included in the target picture, the sequence number displayed on the newly added voice tag is n + 1. In another embodiment, the serial number may be input by user. According to an example, as shown in fig. 7, the user modifies the sequence number displayed on the voice mark 71 from 1 to 2. Therefore, the user can conveniently learn the recording contents corresponding to the voice marks in sequence based on the sequence of the serial number identification.
In one embodiment, step S240 may further include: displaying subject text in the voice markup. For the determination of the subject text, in a specific embodiment, the step S230 may further include: carrying out voice recognition on the collected voice signals to obtain recognition texts; and determining the corresponding subject text according to the recognition text. It should be understood that the speech recognition can be implemented by using the prior art, such as a speech recognition model, and will not be described herein. In a more specific embodiment, wherein determining the corresponding subject text from the recognized text may include: and inputting the recognition text into a pre-trained abstract extraction model to obtain a corresponding abstract text which is used as the subject text. In another more specific embodiment, wherein determining the corresponding subject text from the recognized text may comprise: and inputting the recognition text into a pre-trained keyword extraction model to obtain a corresponding keyword which is used as the subject text. According to an example, assuming that the recognized text includes "the collar of the piece of clothing is designed to be a round collar, which can make the neck of a person appear longer", inputting it into the keyword extraction model, the corresponding keyword can be obtained as "collar", and it is taken as the subject text, as shown in fig. 8, in which the subject text "collar" is displayed in the voice mark 81.
In another embodiment, after step S240, the method may further include: and receiving a custom text input by a user based on the voice mark, and displaying the custom text in the voice mark.
In yet another aspect, in an embodiment, after step S240, the method may further include: and responding to a triggering instruction of the voice mark, and playing the audio file. In a specific embodiment, the triggering command may be a voice control command or a click command. In one example, as shown in fig. 9, in response to a click instruction to the voice mark 91 in the first state, the audio file is played, and at the same time, the voice mark 91 switches to the voice mark 92 displayed in the second state for prompting the user that the recorded audio is being played.
In one embodiment, after step S240, the method may further include: after adding a voice tag associated with the audio file on the target picture, the method further comprises: and displaying a menu bar in response to a triggering instruction of the voice mark, further, in a specific embodiment, a playing icon is included in the menu bar, and the audio file is played in response to the triggering instruction of the playing icon. In another specific embodiment, the menu bar includes a voice recognition icon, and in response to a trigger instruction for the voice recognition icon, performs voice recognition on the audio file to obtain a recognition text; and displaying the identification text in the recording interface. In a more specific embodiment, after displaying the recognition text in the recording interface, the method may further include: and hiding the recognition text in the recording interface in response to a hiding instruction of the recognition text.
According to an example, as shown in fig. 10, a menu bar 102 including a voice recognition icon 103 is displayed in response to a click instruction on a voice mark 101, a recognition text 104 is displayed in response to a trigger instruction on the voice recognition icon 103, and further, the recognition text is hidden in response to a click instruction on a folding icon 105.
In still another aspect, in an embodiment, after step S240, the method may further include: and in response to a deletion instruction of the voice mark, deleting the voice mark from the first picture. In a specific embodiment, a delete button is displayed in response to a long-press instruction for a voice mark, and the voice mark is deleted in response to a click instruction for the delete button. In this way, deletion of the voice tag can be achieved.
In summary, in the method for voice marking of a picture disclosed in the embodiment of the present specification, a user can conveniently and quickly add a voice mark to an arbitrary position in the picture, where an identifier needs to be identified, so that the picture editing cost is greatly reduced, and the user experience is greatly improved.
According to another aspect of the embodiment, the embodiment of the present specification further discloses a picture viewing method. In particular, fig. 11 shows a flowchart of a picture viewing method according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, server cluster with computing and processing capabilities, for example, a client (e.g., picture processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 11, the method comprises the steps of:
step S1110, displaying the picture with the voice mark. It is to be understood that the voice tagging therein is added to the method described in the previous embodiment. Step S1120, in response to the triggering instruction for the voice tag, playing an audio file associated with the voice tag.
The steps are as follows:
first, in step S1110, a picture with a voice tag is displayed.
In one embodiment, this step may include: and displaying the picture in a chat window. In a specific embodiment, in the instant messaging (such as online social or online customer service consultation) scene, the pictures can be received from other instant messaging clients and displayed in the chat window.
In another embodiment, this step may include: and loading the picture in a webpage. In a specific embodiment, in response to an opening instruction for a web page, the web page is entered, and the picture is loaded in the web page. In a specific embodiment, the web page may be a product detail page in an e-commerce platform, and accordingly, the subject object in the picture may be a product. In one example, as shown in FIG. 12, the item detail page shown therein includes a voice marked item picture 121. In another specific embodiment, the web page may be a teaching tutoring website, and accordingly, the subject object in the picture may be a test paper.
In one embodiment, the voice tags in the picture may be one or more.
In the above, a picture with a voice mark can be displayed. Further, in step S1120, in response to the triggering instruction for the voice tag, the audio file associated with the voice tag is played.
In one embodiment, the number of the voice tags in the picture may be multiple, and accordingly, this step may include: and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the voice marks added into the picture.
In another embodiment, the number of the voice tags in the picture may be multiple, where each voice tag displays a corresponding serial number, and accordingly, this step may include: and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the sequence numbers.
On the other hand, step S1120 may be replaced by: and in response to the triggering instruction of the voice recognition icon, converting an audio file corresponding to the voice mark into a voice recognition text, and displaying the voice recognition text.
In summary, by using the picture viewing method disclosed in the embodiment of the present specification, the user can intuitively and conveniently obtain the content that needs to be known based on the picture, thereby improving the user experience.
According to an embodiment of yet another aspect, the present specification also discloses a method of voice tagging video. In particular, fig. 13 shows a flowchart of a method for voice tagging video according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, and apparatus cluster with computing and processing capabilities, for example, a client (e.g., video processing software), system software, or a client plug-in (e.g., a plug-in an instant messaging client). As shown in fig. 13, the method includes the steps of:
step S1310, displaying a recording interface including a first video, where the first video includes a first video frame; step S1320, determining the first video frame as a target video frame in response to a selection instruction for the first video frame; step S1330 of continuously acquiring a voice signal in response to a recording start instruction issued based on the recording interface; step S1340, responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signal as an audio file; step S1350, adding a voice tag associated with the audio file to the target video frame.
The steps are as follows:
first, in step S1310, a recording interface including a first video is displayed, where the first video includes a first video frame.
In one embodiment, prior to this step, the method may further comprise: in response to a triggering instruction for the first video, displaying a voice mark icon, and accordingly, this step may include: and responding to a triggering instruction of the voice mark icon, and jumping to the recording interface. In a specific embodiment, the triggering instruction for the first video may include a viewing instruction, or an editing instruction, or a sending instruction.
In one embodiment, this step may include: and responding to an import instruction aiming at the first video sent by a recording interface, and displaying the first video in the recording interface. In a specific embodiment, the recording interface may be an editing interface in the video processing software. In a specific embodiment, the import command may be a voice control command or a click command, such as a click command on an import icon. In a specific example, the recording interface is accessed in response to an opening instruction of a user for the video processing APP, existing videos in the terminal are displayed in response to a click instruction for a video import icon in the recording interface, and then a certain video is displayed on the recording interface in response to a selection instruction and an import confirmation instruction for the certain video by the user.
In the above, a recording interface including the first video may be displayed. Next, in step S1320, in response to the selection instruction for the first video frame, the first video frame is determined as the target video frame.
In one embodiment, this step may include: in response to a jump instruction issued for the first video, determining the first video frame of the jump display as the target video frame. In a particular embodiment, the jump instruction may correspond to: and dragging the video progress bar. In another specific embodiment, the jump instruction may correspond to: and inputting a playing time point of the first video frame.
In another embodiment, this step may include: in response to a pause instruction issued for the first video, determining the first video frame whose display is paused as the target video frame. In a specific embodiment, before receiving the pause instruction, the method may further include: and playing the first video.
According to one example, as shown in fig. 14, in response to a drag operation of the progress bar 141, the first video frame 142 displayed for jumping is determined as a target video frame. In this manner, a target video frame may be determined.
Then, in step S1330, in response to a recording start instruction issued based on the recording interface, a voice signal is continuously acquired; step S1340, responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signal as an audio file; step S1350, adding a voice tag associated with the audio file to the target video frame.
In one embodiment, step S1330 may include: continuously acquiring voice signals in response to a recording start instruction issued based on a first position of a target video frame in the recording interface; accordingly, step S1350 may include: adding the voice tag at a first location of the target video frame.
It should be noted that, since the target video frame is also a picture in nature, the descriptions of step S1330 to step S1350 can be referred to the descriptions of step S220 to step S240 in the foregoing embodiment.
Further, in one embodiment, after step S1350, the method may further include: adding a time point mark in the progress bar of the first video, wherein the time point mark corresponds to the playing time point of the target video frame. In a specific embodiment, the shape of the time point mark is not limited, and may be a circle, a rectangle, a triangle, or the like. In this way, the user can be assisted in quickly locating the video frame to be voice tagged. Further, in a specific embodiment, after adding a time point mark in the progress bar of the first video, the method may further include: in response to movement of an input control indicator to the point in time marker, the number of voice markers and/or the subject text included in the target video frame is presented. In a more specific embodiment, the input control indicator may be a cursor or an indicator generated by a user through a touch operation on the screen. In a more specific embodiment, the number of voice marks and/or the subject text may be displayed in a pop-up window or bubble.
According to an example, as shown in fig. 15, in response to moving the cursor to the time point mark 151, the number of voice marks included in the corresponding video frame and the subject text corresponding to each voice mark, such as the number of voice marks, are displayed in the chat bubble 152: and 3, the theme texts are respectively: city doors, city walls and ancient clocks. In this way, voice tagging and verification of video frames in a video can be achieved.
In summary, the method for performing voice tagging on video disclosed in the embodiments of the present specification can enable a user to conveniently and quickly add a voice tag to any position in any video frame in a video, so as to enrich the way of editing the video, and save the time consumed by capturing the video frame, editing the video frame, and the like, which is originally needed, thereby improving the user experience.
According to an embodiment of yet another aspect, the present specification also discloses a video viewing method. In particular, fig. 16 shows a flowchart of a video viewing method according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, and apparatus cluster with computing and processing capabilities, for example, may be a client (e.g., video processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 16, the method includes the steps of:
step S1610, displaying a video with a voice tag, where the video includes a first video frame with a first voice tag. It is to be understood that the voice tags are added by the method of voice tagging video described in the foregoing embodiments. Step S1620, in response to the trigger instruction for the first voice tag, playing the audio file associated with the first voice tag.
The steps are as follows:
first, in step S1610, a video with a voice tag is displayed, including a first video frame with a first voice tag.
In one embodiment, this step may include: displaying the video in a chat window. In a particular embodiment, in an instant messaging scenario (e.g., online social or online customer service consultation), the video may be received from other instant messaging clients and displayed in a chat window.
In another embodiment, this step may include: and loading the video in a webpage. In a specific embodiment, in response to an open instruction to a web page, the web page is entered and the video is loaded in the web page. In a specific embodiment, the web page may be a product detail page in an e-commerce platform. In another embodiment, the web page may be an instructional tutoring web site. In another embodiment, the web page may be a playback page provided by the video playback platform.
In one embodiment, there may be one or more video frames with voice tags in the video. Further, the number of the carried voice tags may be one or more for any one of the video frames.
Next, in step S1620, in response to the trigger instruction for the first voice tag, playing an audio file associated with the first voice tag. In one example, as shown in FIG. 17, in response to an operation to move a cursor to a voice tag 171, its associated audio file is played.
It should be noted that, for the description of step S1620, reference may also be made to the foregoing description about step 1120 and the like.
In addition, in one embodiment, the first video frame comprises a plurality of voice marks including the first voice mark, and the progress bar of the video displays a time point mark at the playing time point of the first video frame. Accordingly, after step S1620, the method may further include: and in response to the input control indicator moving to the time point mark, showing the mark number of the voice marks and/or the subject texts corresponding to the voice marks. For this, reference may be made to the related description in the foregoing embodiments, which are not repeated.
In summary, by using the video viewing method disclosed in the embodiment of the present specification, a user can intuitively and conveniently acquire specific description contents made for a part of video frames, thereby improving user experience.
The above description mainly describes the method of adding voice tags to multimedia media (such as pictures and videos, and may actually include electronic documents, etc.), and viewing multimedia media with voice tags. Further, the inventor proposes that the method can be expanded to the category of text labels, and specifically, the addition of the text labels can be realized by combining a recording function and a voice recognition technology.
Fig. 18 shows a flowchart of a method for text marking of a picture according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, and apparatus cluster with computing and processing capabilities, for example, may be a client (e.g., picture processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 18, the method includes the steps of:
step S1810, displaying an editing interface aiming at the target picture; step S1820, responding to a recording starting instruction sent out based on the editing interface, and continuously collecting voice signals; step S1830, in response to the recording end instruction sent out based on the editing interface, performing voice recognition on the collected voice signal to obtain a recognition text; step S1840, adding a text label associated with the identification text on the target picture.
The steps are as follows:
first, in step S1810, an editing interface for a target picture is displayed.
In one embodiment, the target picture may be imported into picture processing or picture editing software, and an editing interface for the target picture may be displayed. In another embodiment, the target picture may be selected from the gallery, and then the edit icon may be clicked, thereby entering the edit interface including the target picture. It should be understood that the editing interface herein may provide other picture processing functions, such as graffiti, etc., in addition to the function of adding voice tags.
In the above, an editing interface for the target picture may be displayed. Next, in step S1820, in response to a recording start instruction issued based on the editing interface, a voice signal may be continuously collected; step S1830, in response to the recording end instruction sent out based on the editing interface, performing voice recognition on the collected voice signal to obtain a recognition text; step S1840, adding a text label associated with the identification text on the target picture.
In one embodiment, step S1820 may include: in response to a recording start instruction issued based on the first position of the target picture in the editing interface, continuously acquiring a voice signal, and accordingly, in step S1840, the method may further include: and adding the text mark at the first position of the target picture.
In one embodiment, step S1840 may further include: displaying the recognition text in the text label.
In one embodiment, step S1840 may further include: and folding and displaying the identification text in the text mark. Accordingly, after step S1840, the method may further include: and in response to a triggering instruction of the text mark, expanding and displaying the identification text in the text mark.
On the other hand, in one embodiment, step S1840 may further include: displaying a sequence number in the text label, the sequence number being determined based on the number of previous text labels added to the target picture. In another embodiment, step S1830 may further include: determining a subject text corresponding to the recognition text, and accordingly, in step S1840, may further include: displaying the subject text in the text label.
Further, after step S1840, the method may further include: and displaying the identification text in response to a triggering instruction of the text mark.
According to a specific example, as shown in fig. 19, a serial number 3 is displayed in a newly added text mark 191, and a corresponding recognition text 192 is presented in response to a click instruction to the text mark 191.
In summary, in the method for text marking of a picture disclosed in the embodiments of the present specification, a user can conveniently and quickly add text marks to any position in the picture that needs to be marked, so that the picture editing cost is greatly reduced, and the user experience is greatly improved.
According to another embodiment, the present specification further discloses a picture viewing method, and in particular, fig. 20 shows a flowchart of a picture viewing method according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, server cluster with computing and processing capabilities, for example, may be a client (e.g., picture processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 20, the method includes the steps of:
step S2010, displaying the picture with text labels, wherein the text labels are added based on the method described in the foregoing embodiment, which is to be understood. Step S2020, in response to the trigger instruction for the text label, displaying the identification text associated with the text label.
The steps are as follows:
first, in step S2010, a picture with text marks is displayed. In one embodiment, the present step may further include: displaying a sequence number in the text label. In another embodiment, the present step may further include: and displaying the subject text corresponding to the identification text in the text mark. In yet another embodiment, the present step may further comprise: and folding and displaying the identification text in the text mark.
Next, in step S2020, in response to a trigger instruction for the text label, the identification text associated with the text label is presented. In one embodiment, this step may include: and expanding and displaying the identification text in the text mark.
It should be noted that after step S2020, the method may further include: and responding to a retraction instruction for recognizing the text, and restoring and displaying the text mark.
According to a specific example, as shown in fig. 21, in response to a click instruction for a text mark 211, a corresponding recognized text 212 is displayed in an expanded state, and further, in response to a click instruction for a folding icon 213, a recognized text is displayed in a folded state.
In summary, by using the picture viewing method disclosed in the embodiment of the present specification, the user can intuitively and conveniently obtain the content that needs to be known based on the picture, thereby improving the user experience.
According to an embodiment of yet another aspect, the present specification also discloses a method of text marking a video. In particular, fig. 22 shows a flowchart of a method for text marking of videos according to an embodiment, where an execution subject of the method may be any device, apparatus, platform, or apparatus cluster with computing and processing capabilities, for example, may be a client (e.g., video processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 22, the method includes the steps of:
step S2210, displaying an editing interface including a first video, the first video including a first video frame; step S2220, in response to a selection instruction for the first video frame, determining the first video frame as a target video frame; step S2230, responding to a recording starting instruction sent out based on the editing interface, and continuously collecting voice signals; step S2240, responding to a recording ending instruction sent out based on the editing interface, and carrying out voice recognition on the collected voice signals to obtain a recognition text; step S2250, add a text label associated with the recognized text on the target video frame.
With respect to the above steps, in one embodiment, step S2230 may include: in response to a recording start instruction issued based on a first position of a target video frame in the editing interface, continuously acquiring a voice signal, and accordingly, in step S2250, may include: and adding the text mark at the first position of the target video frame.
According to a specific example, as shown in fig. 23, a text mark 231 is added to a certain video frame in a video, and a corresponding recognition text 232 is presented in response to a click command for the text mark 231.
It should be noted that, for the introduction of step S2210-step S2250, reference may also be made to the related description in the foregoing embodiments.
In summary, the method for text marking of video disclosed in the embodiments of the present specification can enable a user to conveniently and quickly add text marks to any position in any video frame in the video, so as to enrich the way of editing the video, and save the time consumed by intercepting the video frame and editing the video frame, thereby improving the user experience.
According to an embodiment of yet another aspect, the present specification also discloses a video viewing method. In particular, fig. 24 shows a flowchart of a video viewing method according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, and apparatus cluster with computing and processing capabilities, for example, may be a client (e.g., video processing software), system software, or a client plug-in (e.g., plug-in an instant messaging client). As shown in fig. 24, the method includes the steps of:
step S2410, displaying a video with a text mark, wherein the video comprises a first video frame with a first text mark and needs to be explained, and the text mark is added by the method in the embodiment; step S2420, responding to a triggering instruction of the first text mark, and displaying the identification text associated with the text mark.
According to one example, as shown in fig. 25, in response to an instruction to move a cursor to a text marker 251 in a video frame, corresponding recognition text 252 is displayed at a position other than the video frame in the video play interface.
In summary, by using the video viewing method disclosed in the embodiment of the present specification, a user can intuitively and conveniently acquire specific description contents made for a part of video frames, thereby improving user experience.
Corresponding to the method, the embodiment of the specification also discloses various devices. The method comprises the following specific steps:
FIG. 26 illustrates a block diagram of an apparatus for voice tagging a picture, according to one embodiment. As shown in fig. 26, the apparatus 2600 comprises:
a display unit 2610 which displays a recording interface including a target picture; the acquisition unit 2620 is used for responding to a recording start instruction sent by the recording interface and continuously acquiring voice signals; a storage unit 2630 that stores the acquired voice signal as an audio file in response to a recording end instruction issued based on the recording interface; an adding unit 2640 adds a voice tag associated with the audio file on the target picture.
In one embodiment, the display unit 2610 is specifically configured to: and responding to an import instruction aiming at the target picture sent based on a recording interface, and displaying the target picture in the recording interface.
In one embodiment, the apparatus 2600 further comprises: the triggering unit is configured to respond to a triggering instruction aiming at the target picture and display a voice mark icon; the display unit 2610 is specifically configured to: and responding to a triggering instruction of the voice mark icon, and jumping to the recording interface.
In a specific embodiment, the triggering unit is specifically configured to: displaying the voice markup icon in response to a viewing instruction for the target picture; or, responding to an editing instruction aiming at the target picture, displaying the voice mark icon, or responding to a screen capturing instruction for capturing the target picture, and displaying the voice mark icon; or, responding to a sending instruction aiming at the target picture, and displaying the voice mark icon.
In another specific embodiment, the apparatus 2600 further comprises: a play interface display unit configured to display a play interface for the first video; wherein the trigger unit is specifically configured to: responding to a jump instruction sent aiming at the first video, determining a video frame subjected to jump display as the target picture, and displaying the voice mark icon on the playing interface; or, in response to a pause instruction issued for the first video, determining a video frame which is paused to be displayed as the target picture, and displaying the voice mark icon on the playing interface.
In one embodiment, the display unit 2610 is further configured to: and displaying prompt information in the recording interface, wherein the prompt information is used for prompting a user of an operation mode of adding a voice mark to the target picture.
In one embodiment, the acquisition unit 2620 is specifically configured to: responding to a recording starting instruction sent out based on a first position of a target picture in the recording interface, and continuously acquiring voice signals; the adding unit 2640 is specifically configured to: and adding the voice mark at a first position of the target picture.
In one embodiment, the recording start instruction corresponds to: and for the long-time pressing operation of the target picture, the recording ending instruction corresponds to: and canceling the pressing operation of the target picture.
In one embodiment, the adding unit 2640 is further configured to: displaying a sequence number in the voice tag, wherein the sequence number is determined based on the number of the previous voice tags added to the target picture or is input by a user in a customized mode.
In one embodiment, the storage unit 2630 is further configured to: carrying out voice recognition on the collected voice signals to obtain recognition texts; determining a subject text corresponding to the identification text; wherein the adding unit 2640 is further configured to: displaying the subject text in the voice markup.
In a specific embodiment, the storage unit 2630 is specifically configured to: inputting the recognition text into a pre-trained abstract extraction model to obtain a corresponding abstract text as the subject text; or inputting the recognition text into a pre-trained keyword extraction model to obtain a corresponding keyword as the subject text.
In one embodiment, the apparatus 2600 further comprises: the receiving unit is configured to receive a custom text input by a user based on the voice mark; a first text display unit configured to display the custom text in the voice tag.
In one embodiment, the apparatus 2600 further comprises: and the playing unit is configured to respond to a triggering instruction of the voice mark and play the audio file.
In one embodiment, the apparatus 2600 further comprises: a moving unit configured to move the voice tag to a specified position in the target picture in response to a movement instruction for the voice tag.
In one embodiment, the apparatus 2600 further comprises: a deleting unit configured to delete the voice tag from the first picture in response to a deletion instruction for the voice tag.
In one embodiment, the apparatus 2600 further comprises: the menu bar display unit is configured to respond to a triggering instruction of the voice mark and display a menu bar, and the menu bar comprises a voice recognition icon; the recognition unit is configured to respond to a triggering instruction of the voice recognition icon, perform voice recognition on the audio file and obtain a recognition text; and the second text display unit is configured to display the identification text in the sound recording interface.
In a specific embodiment, the apparatus 2600 further comprises: a hiding unit configured to hide the recognition text in the recording interface in response to a hiding instruction for the recognition text.
FIG. 27 shows a diagram of a picture viewing device structure, according to one embodiment. As shown in fig. 27, the apparatus 2700 includes:
a display unit 2710 configured to display a picture with voice tags added thereto by the above apparatus 2600; the playing unit 2720 is configured to play the audio file associated with the voice tag in response to a triggering instruction for the voice tag.
In one embodiment, the display unit 2710 is specifically configured to: displaying the picture in a chat window; or, loading the picture in a webpage.
In a specific embodiment, the picture includes a target commodity, and the webpage is a commodity detail page.
In an embodiment, the voice tag is a plurality of voice tags, where the playing unit 2720 is specifically configured to: and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the voice marks added into the picture.
In one embodiment, the voice tag is a plurality of voice tags, wherein each voice tag displays a corresponding serial number; the playing unit 2720 is specifically configured to: and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the sequence numbers.
FIG. 28 illustrates a block diagram of an apparatus for voice tagging video, according to one embodiment. As shown in fig. 28, the apparatus 2800 includes:
a display unit 2810 configured to display a recording interface including a first video, the first video including a first video frame; a determining unit 2820, configured to determine the first video frame as a target video frame in response to a selection instruction for the first video frame; an acquisition unit 2830 configured to continuously acquire a voice signal in response to a recording start instruction issued based on the recording interface; a storage unit 2840 configured to store the acquired voice signal as an audio file in response to a recording end instruction issued based on the recording interface; an adding unit 2850 configured to add a voice tag associated with the audio file on the target video frame.
In one embodiment, the acquisition unit 2830 is specifically configured to: continuously acquiring voice signals in response to a recording start instruction issued based on a first position of a target video frame in the recording interface; the adding unit 2850 is specifically configured to: adding the voice tag at a first location of the target video frame.
In one embodiment, the determining unit 2820 is specifically configured to: determining the first video frame of the jump display as the target video frame in response to a jump instruction issued for the first video; or, in response to a pause instruction issued for the first video, determining the first video frame, which is paused to be displayed, as the target video frame.
In one embodiment, the apparatus 2800 further comprises: a progress bar marking unit configured to add a time point mark in the progress bar of the first video, the time point mark corresponding to a playing time point of the target video frame.
In a specific embodiment, the apparatus 2800 further comprises: a first presentation unit configured to present the number of voice tags included in the target video frame in response to movement of an input control indicator to the time point tag.
In another specific embodiment, the storage unit 2840 is further configured to: carrying out voice recognition on the collected voice signals to obtain recognition texts; determining a subject text corresponding to the identification text; the device further comprises: and the second display unit is configured to display the theme text in response to the input control indicator moving to the time point mark.
FIG. 29 shows a video viewing device structure diagram according to one embodiment. As shown in fig. 29, the apparatus 2900 includes:
a display unit 2910 configured to display a video with a voice tag added to the video through the above apparatus 2800, the video including a first video frame with a first voice tag; a playing unit 2920 configured to play the audio file associated with the first voice tag in response to the triggering instruction for the first voice tag.
In one embodiment, the first video frame comprises a plurality of voice marks including the first voice mark, and the progress bar of the video displays a time point mark at the playing time point of the first video frame; the device further comprises: and the display unit is configured to respond to the input control indicator moving to the time point mark, and display the mark number of the voice marks and/or the theme texts corresponding to the voice marks.
FIG. 30 illustrates a block diagram of an apparatus for text tagging of pictures, according to one embodiment. As shown in fig. 30, the apparatus 3000 further comprises:
a display unit 3010 configured to display an editing interface for the target picture; a collecting unit 3020 configured to continuously collect a voice signal in response to a recording start instruction issued based on the editing interface; the recognition unit 3030 is configured to respond to a recording ending instruction sent based on the editing interface, perform voice recognition on the collected voice signal, and obtain a recognition text; an adding unit 3040 configured to add a text label associated with the recognition text on the target picture.
In one embodiment, the acquisition unit 3020 is specifically configured to: responding to a recording starting instruction sent out based on a first position of a target picture in the editing interface, and continuously collecting voice signals; the adding unit 3040 is specifically configured to: and adding the text mark at the first position of the target picture.
In one embodiment, the adding unit 3040 is further configured to: displaying a sequence number in the text label, the sequence number being determined based on the number of previous text labels added to the target picture.
In one embodiment, the adding unit 3040 is further configured to: folding and displaying the identification text in the text mark; the device further comprises an expansion unit which is configured to respond to a triggering instruction of the text mark and expand and display the identification text in the text mark.
In one embodiment, the apparatus 3000 further comprises: the determining unit is configured to determine a subject text corresponding to the recognition text; the adding unit 3040 is specifically configured to: displaying the subject text in the text label.
In one embodiment, the apparatus 3000 further includes a presentation unit configured to present the identification text in response to a trigger instruction for the text mark.
FIG. 31 illustrates a picture viewing device structure diagram according to one embodiment. As shown in fig. 31, the apparatus 3100 includes:
a display unit 3110 configured to display a picture with text labels added to the picture by the above apparatus 3000; the presentation unit 3120 is configured to present the identification text associated with the text label in response to a trigger instruction for the text label.
In one embodiment, the display unit 3120 is specifically configured to: displaying a sequence number in the text label; or displaying the subject text corresponding to the recognition text in the text label.
In one embodiment, the display unit 3120 is specifically configured to: folding and displaying the identification text in the text mark; the presentation unit 3120 is specifically configured to: and expanding and displaying the identification text in the text mark.
FIG. 32 illustrates a block diagram of an apparatus for text marking a video, according to one embodiment. As shown in fig. 32, the device 3200 includes:
a display unit 3210 configured to display an editing interface including a first video, the first video including a first video frame; a determining unit 3220 configured to determine the first video frame as a target video frame in response to a selection instruction for the first video frame; a collecting unit 3230 configured to continuously collect a voice signal in response to a recording start instruction issued based on the editing interface; the recognition unit 3240 is configured to perform voice recognition on the acquired voice signal in response to a recording end instruction sent based on the editing interface to obtain a recognition text; an adding unit 3250 configured to add a text label associated with the identification text on the target video frame.
In one embodiment, the collecting unit 3230 is configured to continuously collect the voice signal in response to a recording start instruction issued based on a first position of a target video frame in the editing interface; the adding unit 3250 is specifically configured to: and adding the text mark at the first position of the target video frame.
FIG. 33 shows a video viewing device structure diagram according to one embodiment. As shown in fig. 33, the apparatus 3300 includes:
a display unit 3310 configured to display a video with a text label added to the video by the above-described apparatus 3200, the video including a first video frame with a first text label; a presentation unit 3320 configured to present the identification text associated with the text label in response to a trigger instruction for the first text label.
It should be noted that the voice tag adding function described above can be applied to a plurality of scenes. The method for adding the voice tag is further described below with reference to specific application scenarios. The method comprises the following specific steps:
fig. 34 is a flowchart of a picture processing method according to an embodiment, and an execution subject of the method may be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an instant messaging client. As shown in fig. 34, the method includes the steps of:
step S3410, displaying a chat interface, and receiving a target picture to be sent selected based on the chat interface; step S3420, responding to an editing instruction aiming at the target picture, entering a picture editing interface, wherein a voice mark icon is displayed; step S3430, responding to a trigger instruction for the voice mark icon, and entering a recording interface; step S3440, storing the voice signal collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
In one embodiment, step S3440 may include: and displaying the chat nickname and/or the chat head portrait of the current editing user in the voice mark. In one example, in the target picture 3501 sent by the current user displayed in the chat window in fig. 35, a voice tag 3502 with the head portrait of the current user is displayed.
In one embodiment, after performing step S3440, the method may further include: and sending the target picture with the voice tag in response to the sending instruction of the target picture.
Therefore, the method can be realized in the chat software, the voice mark is added to the picture, and the target picture with the voice mark is sent.
Fig. 36 is a flowchart of a picture processing method according to another embodiment, and an execution subject of the method may be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an instant messaging client. As shown in fig. 36, the method includes the steps of:
step S3610, displaying a chat interface, wherein a chat window of the chat interface comprises a target picture; step S3620, responding to a trigger instruction aiming at the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; step S3630, responding to a trigger instruction of the voice mark icon, and entering a recording interface; step S3640, storing the voice signal collected based on the recording interface as an audio file, and adding a voice tag associated with the audio file to the target picture.
In view of the above steps, in one example, the chat window of the chat interface shown in fig. 3 includes a target picture 32, a menu bar 33 is displayed in response to a trigger instruction for the target picture, the menu bar 33 includes a voice tag icon 34, and a recording interface is entered in response to the trigger instruction for the voice tag icon. Further, a voice tag may be added to the target picture.
In one embodiment, before step S3610, the method may further include: and receiving the target picture from the contact corresponding to the chat window, wherein the target picture comprises the existing voice mark, and the chat nickname and/or the chat head portrait of the contact are/is displayed. Accordingly, step S3640 may include: and displaying the chat nickname and/or the chat head portrait of the current editing user in the voice mark.
In one example, fig. 37 shows a schematic diagram of a chat interface according to another embodiment, where a chat window of the interface includes a picture 3701 received from a current contact with a voice tag 3702 thereon to display a head portrait of the current contact, and a picture 3703 sent after a current user edits the picture 3701 is also included in the chat window, and a voice tag 3704 is further included thereon compared to the picture 3701 to display a head portrait of the current user.
In one embodiment, after performing step S3640, the method may further include: and responding to an exit instruction aiming at the recording interface, and updating and displaying the original target picture as the target picture with the voice mark in the chat window. Further, in a specific embodiment, a prompt message may be displayed in the chat window to inform each contact that the target picture has been modified.
In one example, fig. 38 illustrates a chat interface diagram according to yet another embodiment, where a picture 3801 in the chat window is updated to be displayed as a picture 3803 with a voice tag 3802 and a prompt 3804 is displayed, i.e., a "small" bar has added a voice tag in the picture "1. jpg".
Therefore, in the instant messaging scene, different users can add voice marks to the same picture respectively, and therefore communication efficiency among the users can be improved.
Fig. 39 is a flowchart of a picture processing method according to still another embodiment, and an execution subject of the method may be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an instant messaging client. As shown in fig. 39, the method includes the steps of:
step S3910, displaying a chat interface, wherein a chat window of the chat interface comprises a target picture with a first voice mark, and the first voice mark is added by a current contact corresponding to the chat window; step S3920, responding to a trigger instruction of the first voice mark, and displaying a menu bar, wherein the menu bar comprises a voice reply icon; step S3930, responding to a trigger command of the voice reply icon, and entering a recording interface; step S3940, storing the voice signal collected based on the recording interface as an audio file, and adding a second voice mark related to the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
In an example, fig. 40 shows a schematic diagram of a chat interface according to still another embodiment, as shown in fig. 40, a chat window includes a picture 4002 with a voice mark 4001, a menu bar is displayed in response to a click instruction for the voice mark 4001, a voice reply icon 4003 is included in the menu bar, a recording interface is entered in response to a trigger instruction for the voice reply icon 4003, and further, a voice mark 4004 can be added near the voice mark 4001 for a user to input voice. As shown in fig. 41, it is possible to add a user input voice to an audio file corresponding to a voice mark 4001 and update the voice mark 4001 displaying sequence number 1 to 4101 displaying sequence number 2, where the meanings of sequence numbers 1 and 2 are: the number of different users edited, or the total number of edits.
In one embodiment, after performing step S3940, the method may further include: and responding to an exit instruction aiming at the recording interface, and displaying prompt information in the chat window so as to inform all parties that the voice marks in the target pictures of the contacts are modified. In one example, fig. 41 illustrates a schematic view of a chat interface according to yet another embodiment, as shown in fig. 41, a prompt 4102 is displayed in a chat window of the chat interface, and a click can be listened to by updating voice markup content in a picture "1. jpg" of "small oil bar".
Therefore, the method can realize direct reply to the voice mark in an instant messaging scene, the reply has stronger directivity, and a user can communicate with the content of the designated position of the picture conveniently, intuitively and quickly.
Fig. 42 shows a flowchart of a picture processing method according to still another embodiment, and an execution subject of the method may be any device, apparatus, platform, or apparatus cluster with computing and processing capabilities, for example, i.e. picture editing software. As shown in fig. 42, the method includes the steps of:
step S4210, displaying a picture editing interface containing a target picture, wherein a function menu of the interface comprises a voice mark icon; step S4220, responding to a trigger instruction of the voice mark icon, and entering a voice mark interface; step S4230, converting the input text received based on the voice tag interface into an audio file, and adding a voice tag associated with the audio file to the target picture.
Through the steps, the voice marks can be added to the pictures in a text input mode, so that the interaction mode of adding the voice marks by a user is enriched, and the user experience is improved.
Fig. 43 shows a flowchart of a picture processing method according to an embodiment, and an execution subject of the method may be any device, apparatus, platform, or apparatus cluster with computing and processing capabilities, for example, i.e. picture editing software. As shown in fig. 43, the method includes the steps of:
step S4310, displaying a chat interface containing the target picture; step S4320, responding to a trigger instruction of the target picture, displaying a menu bar, wherein the menu bar comprises a voice adding emoticon; step S4330, responding to a trigger instruction for adding an emoticon to the voice, and entering a voice adding emoticon interface; step S4340, converting the voice signal collected based on the voice adding expression interface into characters, and generating animation expressions based on the characters; step S4350, adding the animation expression in the target picture.
With respect to the above steps, in one embodiment, step S4340 may include: retrieving an original expression associated with the text from an expression library; and adding the characters in the original expression to obtain the animation expression. According to one example, as shown in FIG. 44, where a picture (a certain figure) includes an animated expression 4401 added by a user through a voice input (e.g., the input content is Haraha).
By adopting the method, under the instant communication scene, the user can generate the animation expression through voice input and edit or reply the target picture, so that the interest of chatting among the users and the convenience of communication are improved.
Fig. 45 shows a flowchart of a picture processing method according to another embodiment, and an execution subject of the method can be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an e-commerce platform. As shown in fig. 45, the method includes the steps of:
step S4510, displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture for a target commodity; step S4520, responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; step S4530, responding to a trigger instruction of the voice mark icon, and entering a recording interface; and S4540, storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
According to an example, fig. 46 shows a schematic diagram of a product information editing interface according to an embodiment, as shown in fig. 46, a product picture 4602 is included in a product information editing interface 4601, a menu bar including a voice mark icon 4603 may be displayed in response to a trigger instruction for the product picture 4602, and further, a voice mark including addition, deletion and modification may be edited for the product picture 4602 by triggering the voice mark icon 4603 to enter a recording interface.
In one embodiment, after step S4540, the method may further include: and displaying the target picture in a commodity detail page aiming at the target commodity. In one example, FIG. 12 shows an item detail sheet including a voice tagged target picture 121.
Therefore, the target commodity can be displayed to the user more intuitively through the commodity introduction picture with the voice mark in the commodity detail page.
Fig. 47 shows a flowchart of a picture processing method according to another embodiment, and an execution subject of the method can be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an e-commerce platform. As shown in fig. 47, the method includes the steps of:
step S4710, displaying an order evaluation interface for the first order, wherein the order evaluation interface comprises an added picture icon; step S4720, responding to the trigger instruction aiming at the picture adding icon, and receiving a selected target picture; step S4730, responding to a voice marking instruction sent to the target picture, and entering a recording interface; step S4740, storing the voice signal collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file to the target electronic file.
Therefore, the user can more specifically evaluate the product.
Fig. 48 shows a flowchart of a picture processing method according to still another embodiment, and an execution subject of the method may be any device, equipment, platform, or equipment cluster with computing and processing capabilities, for example, an e-commerce platform. As shown in fig. 48, the method includes the steps of:
step S4810, displaying a commodity evaluation interface for a target commodity, wherein the commodity evaluation interface comprises a first user evaluation, and the first user evaluation comprises a target picture with a first voice mark; step S4820, responding to a trigger instruction for the target picture, displaying a menu bar, wherein the menu bar comprises a voice reply icon; step S4830, responding to a trigger instruction of the voice reply icon, and entering a recording interface; step S4840, storing the voice signal collected based on the recording interface as an audio file, and adding a second voice tag related to the audio file in the area adjacent to the first voice tag in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
With respect to the above steps, in one embodiment, after performing step S4840, the method may further include: in response to the quit instruction of the recording interface, displaying a notification message of the first user evaluation in the commodity evaluation interface for notifying that the first user evaluation is replied.
Therefore, the seller and the buyer, or the buyer and the buyer can communicate quickly and pertinently based on the commodity picture in the evaluation scene.
FIG. 49 is a flow diagram of a method of picture processing according to yet another embodiment, the subject of execution of which may be a customer service platform. As shown in fig. 49, the method includes the steps of:
step S4910, receiving a conversation message sent by a user, wherein the conversation message comprises a target picture with a first voice mark; step S4920, acquiring an audio file associated with the first voice tag, and performing voice recognition on the audio file to obtain a recognition text; step S4930, inputting the recognition text into a pre-trained user question prediction model, and outputting a corresponding user standard question; step S4940, feeding back the answer to the question corresponding to the user standard question to the user.
With respect to the above steps, in one embodiment, step S4940 may include: converting the question answers into answer audio, and adding a second voice tag associated with the answer audio in the target picture; and sending the target picture newly added with the second voice mark to the user.
In another embodiment, step S4940 may include: converting the question answers to a reply audio; adding a second voice tag associated with the reply audio in an area adjacent to the first voice tag in the target picture; or, adding the reply audio to the audio file corresponding to the first voice mark; and sending prompt information to the user to prompt the user that the reply content is added to the target picture.
Therefore, the convenient communication between the user and the customer service can be realized in the customer service scene.
Fig. 50 is a flowchart of a method for processing an electronic file according to still another embodiment, and an execution subject of the method may be any device, apparatus, platform, apparatus cluster with computing and processing capabilities, such as office software or an auditing platform. As shown in fig. 50, the method includes the steps of:
step S5010, displaying a file processing interface aiming at a target electronic file, wherein a function menu bar of the file processing interface comprises a voice mark icon; step S5020, responding to a trigger instruction of the voice mark icon, and entering a recording interface; step S5030, storing the voice signal acquired based on the recording interface as an audio file, and adding a voice tag associated with the audio file to the target electronic file.
In view of the above steps, in one embodiment, the target electronic document is an electronic contract, and the document processing interface is a contract approval interface. In one embodiment, the file format of the target electronic file is a word document, a PDF document or an excel table. In one example, FIG. 51 illustrates an office software interface diagram according to one embodiment, including a voice markup icon 5101 in the interface menu bar, with voice markup 5102 in the display word document.
In the above way, the file can be subjected to voice marking.
Fig. 52 is a flowchart of a picture processing method according to an embodiment, the execution subject of the method being a live platform. As shown in fig. 52, the method includes:
step S5210, displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture of a target commodity to be placed on a shelf; step S5220, responding to a trigger instruction for the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon; step S5230, responding to a trigger instruction for the voice mark icon, and entering a recording interface; step S5240, storing the voice signal collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
In one embodiment, after step S5240, the method further comprises: displaying a live broadcast interface, wherein the live broadcast interface comprises a commodity shelf icon; responding to a triggering instruction of the commodity shelving image, and displaying a picture of a commodity to be shelved, wherein the picture comprises the target picture; and responding to a selection instruction of the target picture, and displaying the target picture in a commodity display window of the live broadcast interface. In one example of the above-mentioned method,
fig. 53 is a diagram illustrating a live interface with a target picture 5302 with voice marks 5301 in a merchandise display window according to an embodiment.
Above, can realize showing the commodity picture of taking voice mark in live broadcast interface to the user that conveniently watches the live broadcast swiftly knows the target commodity more.
Corresponding to the above method, the embodiments of the present specification also disclose various processing devices. The method comprises the following specific steps:
FIG. 54 shows a diagram of a picture processing device architecture according to one embodiment. As shown in fig. 54, the apparatus 5400 includes:
a display unit 5410 configured to display a chat interface; a receiving unit 5420, configured to receive a target picture to be sent selected based on the chat interface; a first interface switching unit 5430 configured to enter a picture editing interface in which a voice tag icon is displayed, in response to an editing instruction for the target picture; a second interface switching unit 5440 configured to enter a recording interface in response to a trigger instruction for the voice tag icon; a marking unit 5450 configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file on the target picture.
Fig. 55 shows a configuration diagram of a picture processing apparatus according to another embodiment. As shown in fig. 55, the apparatus 5500 includes:
an interface display unit 5510 configured to display a chat interface, where a chat window of the chat interface includes a target picture; a menu bar display unit 5520 configured to display a menu bar including a voice tag icon in response to a trigger instruction for the target picture; the interface switching unit 5530 is configured to respond to a trigger instruction of the voice mark icon and enter a recording interface; the marking unit 5540 is configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file to the target picture.
Fig. 56 shows a diagram of a picture processing apparatus according to still another embodiment. As shown in fig. 56, the apparatus 5600 includes:
an interface display unit 5610 configured to display a chat interface, where a chat window of the chat interface includes a target picture with a first voice tag, and the first voice tag is added by a current contact corresponding to the chat window; a menu bar display unit 5620 configured to display a menu bar including a voice reply icon in response to a trigger instruction for the first voice tag; the interface switching unit 5630 is configured to enter a recording interface in response to a trigger instruction for the voice reply icon; a marking unit 5640 configured to store the voice signal collected based on the recording interface as an audio file, and add a second voice mark associated with the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
Fig. 57 shows a diagram of a picture processing apparatus according to still another embodiment. As shown in fig. 57, the apparatus 5700 includes:
a display unit 5710 configured to display a picture editing interface including a target picture, where a function menu of the picture editing interface includes a voice tag icon; an interface switching unit 5720 configured to enter a voice mark interface in response to a trigger instruction for the voice mark icon; the marking unit 5730 is configured to convert the input text received based on the voice marking interface into an audio file, and add a voice mark associated with the audio file on the target picture.
Fig. 58 shows a diagram of a picture processing apparatus according to still another embodiment. As shown in fig. 58, the apparatus 5800 includes:
an interface display unit 5810 configured to display a chat interface including the target picture; a menu bar display unit 5820 configured to display a menu bar including a voice adding emoticon in response to a trigger instruction for the target picture; an interface switching unit 5830, configured to respond to a trigger instruction for the voice adding emoticon, and enter a voice adding emoticon interface; the expression generating unit 5840 is configured to convert the voice signal collected based on the voice adding expression interface into a text, and generate an animation expression based on the text; an expression adding unit 5850 configured to add the animation expression in the target picture.
Fig. 59 shows a diagram of a picture processing device structure according to an embodiment, the device being integrated in an e-commerce platform, the device 5900 comprising:
an interface display unit 5910 configured to display a goods information editing interface, including a target picture for a target goods; a menu bar display unit 5920 configured to display a menu bar in response to a trigger instruction for the target picture, where the menu bar includes a voice markup icon; the interface switching unit 5930 is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface; a marking unit 5940, configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file on the target picture.
Fig. 60 shows a diagram of a picture processing apparatus according to still another embodiment. As shown in fig. 60, the apparatus 6000 includes:
a display unit 6010 configured to display an order evaluation interface for the first order, where the order evaluation interface includes an added picture icon; a receiving unit 6020 configured to receive the selected target picture in response to the trigger instruction for the picture adding icon; an interface switching unit 6030 configured to enter a recording interface in response to a voice marking instruction issued to the target picture; a marking unit 6040 configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file to the target electronic file.
Fig. 61 shows a configuration diagram of a picture processing apparatus according to another embodiment. As shown in fig. 61, the apparatus 6100 includes:
the interface display unit 6110 is configured to display a product evaluation interface for a target product, where the product evaluation interface includes a first user evaluation, and the first user evaluation includes a target picture with a first voice tag; a menu bar display unit 6120 configured to display a menu bar in response to a trigger instruction for the target picture, where the menu bar includes a voice reply icon; the interface switching unit 6130 is configured to respond to the trigger instruction of the voice reply icon and enter a recording interface; a marking unit 6140, configured to store the voice signal acquired based on the recording interface as an audio file, and add a second voice mark associated with the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
Fig. 62 is a diagram illustrating a structure of a picture processing apparatus according to still another embodiment, the apparatus being integrated with a customer service platform, the apparatus 6200 including:
a receiving unit 6210, configured to receive a session message sent by a user, where the session message includes a target picture with a first voice tag; an obtaining unit 6220, configured to obtain an audio file associated with the first voice tag, perform voice recognition on the audio file, and obtain a recognition text; the prediction unit 6230 is configured to input the recognition text into a pre-trained user question prediction model, and output a corresponding user standard question; a feedback unit 6240 configured to feed back the answer to the question corresponding to the user standard question to the user.
Fig. 63 shows a configuration diagram of a processing apparatus of an electronic document according to an embodiment, the apparatus 6300 includes:
a display unit 6310 configured to display a file processing interface for a target electronic file, a function menu bar of the file processing interface including a voice tag icon; an interface switching unit 6320, configured to enter a recording interface in response to a trigger instruction for the voice tag icon; a marking unit 6330 configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file to the target electronic file.
Fig. 64 shows a diagram of a picture processing device structure according to yet another embodiment, the device being integrated in a live platform, the device 6400 comprising:
an interface display unit 6410 configured to display a goods information editing interface including a target picture for a target goods to be shelved; a menu bar display unit 6420 configured to display a menu bar in response to a trigger instruction for the target picture, the menu bar including a voice markup icon therein; the interface switching unit 6430 is configured to enter a recording interface in response to a triggering instruction for the voice mark icon; a marking unit 6440 configured to store the voice signal collected based on the recording interface as an audio file, and add a voice mark associated with the audio file on the target picture.
According to an embodiment of a further aspect, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2 or 11 or 13 or 16 or 18 or 20 or 22 or 24 or 34 or 36 or 39 or 42 or 45 or 47 or 48 or 49 or 50 or 52.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, which when executed by the processor implements the method described in connection with fig. 2 or 11 or 13 or 16 or 18 or 20 or 22 or 24 or 34 or 36 or 39 or 42 or 45 or 47 or 48 or 49 or 50 or 52.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (84)

1. A method of voice tagging a picture, comprising:
displaying a recording interface comprising a target picture;
continuously acquiring voice signals in response to a recording start instruction sent based on the recording interface;
responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signals as audio files;
and adding a voice mark associated with the audio file on the target picture.
2. The method of claim 1, wherein prior to displaying the recording interface including the target picture, the method further comprises:
responding to a trigger instruction aiming at the target picture, and displaying a voice mark icon;
wherein show the recording interface including the target picture, include:
and responding to a triggering instruction of the voice mark icon, and jumping to the recording interface.
3. The method of claim 2, wherein displaying a voice markup icon in response to a triggering instruction for a target picture comprises:
displaying the voice markup icon in response to a viewing instruction for the target picture; or the like, or, alternatively,
displaying the voice mark icon in response to an editing instruction for the target picture, or,
displaying the voice mark icon in response to a screen capture instruction for capturing the target picture; or the like, or, alternatively,
and responding to a sending instruction aiming at the target picture, and displaying the voice mark icon.
4. The method of claim 2, wherein prior to displaying the voice markup icon in response to the triggering instruction for the target picture, the method further comprises:
displaying a playing interface aiming at the first video;
wherein in response to a triggering instruction for the target picture, displaying a voice markup icon includes:
responding to a jump instruction sent aiming at the first video, determining a video frame subjected to jump display as the target picture, and displaying the voice mark icon on the playing interface; or the like, or, alternatively,
and responding to a pause instruction sent for the first video, determining a video frame which is paused to be displayed as the target picture, and displaying the voice mark icon on the playing interface.
5. The method of claim 1, wherein continuously acquiring voice signals in response to a recording start instruction issued based on the recording interface comprises:
responding to a recording starting instruction sent out based on a first position of a target picture in the recording interface, and continuously acquiring voice signals;
wherein adding a voice tag associated with the audio file to the target picture comprises:
and adding the voice mark at a first position of the target picture.
6. The method of claim 1, wherein the recording start instruction corresponds to: and for the long-time pressing operation of the target picture, the recording ending instruction corresponds to: canceling a pressing operation on the target picture; or the like, or, alternatively,
the recording start instruction corresponds to: and for the click operation of the right mouse button, the recording ending instruction corresponds to: clicking the right mouse button again; or the like, or, alternatively,
the recording start instruction corresponds to: and a trigger instruction of a recording start icon in the recording interface, wherein the recording end instruction corresponds to: and triggering an instruction of a recording ending icon in the recording interface.
7. The method of claim 1, wherein adding a voice tag associated with the audio file on the target picture further comprises:
displaying a sequence number in the voice tag, wherein the sequence number is determined based on the number of the previous voice tags added to the target picture or is input by a user in a customized mode.
8. The method of claim 1, wherein storing the captured voice signal as an audio file in response to a recording end instruction issued based on the recording interface further comprises:
carrying out voice recognition on the collected voice signals to obtain recognition texts;
determining a subject text corresponding to the identification text;
wherein adding a voice tag associated with the audio file to the target picture further comprises:
displaying the subject text in the voice markup.
9. The method of claim 8, wherein determining subject text to which the recognized text corresponds comprises:
inputting the recognition text into a pre-trained abstract extraction model to obtain a corresponding abstract text as the subject text; or the like, or, alternatively,
and inputting the recognition text into a pre-trained keyword extraction model to obtain a corresponding keyword which is used as the subject text.
10. The method of claim 1, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
receiving user-defined text input by a user based on the voice marks;
displaying the custom text in the voice markup.
11. The method of claim 1, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and responding to a triggering instruction of the voice mark, and playing the audio file.
12. The method of claim 1, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and responding to a moving instruction of the voice mark, and moving the voice mark to a specified position in the target picture.
13. The method of claim 1, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and in response to a deletion instruction of the voice mark, deleting the voice mark from the first picture.
14. The method of claim 1, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
responding to a triggering instruction of the voice mark, and displaying a menu bar, wherein the menu bar comprises a voice recognition icon;
responding to a triggering instruction of the voice recognition icon, and performing voice recognition on the audio file to obtain a recognition text;
and displaying the identification text in the recording interface.
15. A picture viewing method, comprising:
displaying a picture with a voice tag added to the picture by the method of claim 1;
and responding to a triggering instruction of the voice mark, and playing an audio file associated with the voice mark.
16. The method of claim 15, wherein displaying the voice tagged picture comprises:
displaying the picture in a chat window; or the like, or, alternatively,
and loading the picture in a webpage.
17. The method of claim 16, wherein the picture includes a target item, and the web page is an item detail page.
18. The method of claim 15, wherein the voice tag is a plurality of voice tags, wherein playing an audio file associated with the voice tag comprises:
and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the voice marks added into the picture.
19. The method of claim 15, wherein the voice tag is a plurality of voice tags, wherein each voice tag displays a corresponding sequence number;
wherein playing the audio file associated with the voice tag comprises:
and sequentially playing a plurality of audio files corresponding to the voice marks based on the sequence of the sequence numbers.
20. A method of voice tagging video, comprising:
displaying a recording interface comprising a first video, the first video comprising a first video frame;
responding to a selection instruction of the first video frame, and determining the first video frame as a target video frame;
continuously acquiring voice signals in response to a recording start instruction sent based on the recording interface;
responding to a recording ending instruction sent out based on the recording interface, and storing the collected voice signals as audio files;
and adding a voice mark associated with the audio file on the target video frame.
21. The method of claim 20, wherein continuously acquiring voice signals in response to a recording start instruction issued based on the recording interface comprises:
continuously acquiring voice signals in response to a recording start instruction issued based on a first position of a target video frame in the recording interface;
wherein adding a voice tag associated with the audio file to the target video frame comprises:
adding the voice tag at a first location of the target video frame.
22. The method of claim 20, wherein determining the first video frame as a target video frame in response to a pick instruction for the first video frame comprises:
determining the first video frame of the jump display as the target video frame in response to a jump instruction issued for the first video; or the like, or, alternatively,
in response to a pause instruction issued for the first video, determining the first video frame whose display is paused as the target video frame.
23. The method of claim 20, wherein after storing the captured voice signal as an audio file in response to a recording end instruction issued based on the recording interface, the method further comprises:
adding a time point mark in the progress bar of the first video, wherein the time point mark corresponds to the playing time point of the target video frame.
24. The method of claim 23, wherein after adding a time point marker in a progress bar of the first video, the method further comprises:
presenting the number of voice markup included in the target video frame in response to movement of an input control indicator to the point in time markup.
25. The method of claim 23, wherein storing the captured voice signal as an audio file in response to a recording end instruction issued based on the recording interface comprises:
carrying out voice recognition on the collected voice signals to obtain recognition texts;
determining a subject text corresponding to the identification text;
wherein after adding a time point marker in a progress bar of the first video, the method further comprises:
in response to movement of an input control indicator to the point in time marker, the subject text is presented.
26. A video viewing method, comprising:
displaying a video with a voice tag added to the video by the method of claim 20, the video including a first video frame with a first voice tag;
and responding to a triggering instruction of the first voice mark, and playing an audio file associated with the first voice mark.
27. The method of claim 26, wherein the first video frame comprises a plurality of voice tags including the first voice tag, and the progress bar of the video displays a time point mark at a playing time point of the first video frame;
wherein after displaying the video with the voice tag, the method further comprises:
and in response to the input control indicator moving to the time point mark, showing the mark number of the voice marks and/or the subject texts corresponding to the voice marks.
28. A method of text tagging a picture, comprising:
displaying an editing interface aiming at the target picture;
responding to a recording starting instruction sent out based on the editing interface, and continuously acquiring voice signals;
responding to a recording ending instruction sent out based on the editing interface, and performing voice recognition on the collected voice signal to obtain a recognition text;
and adding a text mark associated with the identification text on the target picture.
29. The method of claim 28, wherein continuously acquiring voice signals in response to a recording start instruction issued based on the editing interface comprises:
responding to a recording starting instruction sent out based on a first position of a target picture in the editing interface, and continuously collecting voice signals;
wherein adding a text label associated with the recognition text on the target picture comprises:
and adding the text mark at the first position of the target picture.
30. The method of claim 28, wherein adding a text label on the target picture that is associated with the recognition text, further comprises:
displaying a sequence number in the text label, the sequence number being determined based on the number of previous text labels added to the target picture.
31. The method of claim 28, wherein adding a text label on the target picture that is associated with the recognition text, further comprises:
folding and displaying the identification text in the text mark;
wherein, after adding a text label associated with the recognition text on the target picture, the method further comprises:
and in response to a triggering instruction of the text mark, expanding and displaying the identification text in the text mark.
32. The method of claim 28, wherein after obtaining the recognized text, the method further comprises:
determining a subject text corresponding to the identification text;
wherein adding a text label associated with the recognition text on the target picture comprises:
displaying the subject text in the text label.
33. The method of claim 28, wherein after adding a text label on the target picture that associates the recognition text, the method further comprises:
and displaying the identification text in response to a triggering instruction of the text mark.
34. A picture viewing method, comprising:
displaying a picture with text labels added to the picture by the method of claim 31;
and displaying the identification text associated with the text mark in response to a triggering instruction of the text mark.
35. The method of claim 34, wherein displaying the picture with the text label further comprises:
displaying a sequence number in the text label; or the like, or, alternatively,
and displaying the subject text corresponding to the identification text in the text mark.
36. The method of claim 34, wherein displaying the picture with the text label further comprises:
folding and displaying the identification text in the text mark;
wherein, the displaying the identification text associated with the text label comprises:
and expanding and displaying the identification text in the text mark.
37. A method of text tagging a video, comprising:
displaying an editing interface comprising a first video, the first video comprising a first video frame;
responding to a selection instruction of the first video frame, and determining the first video frame as a target video frame;
responding to a recording starting instruction sent out based on the editing interface, and continuously acquiring voice signals;
responding to a recording ending instruction sent out based on the editing interface, and performing voice recognition on the collected voice signal to obtain a recognition text;
and adding a text mark associated with the identification text on the target video frame.
38. The method of claim 37, wherein continuously acquiring voice signals in response to a recording start instruction issued based on the editing interface comprises:
continuously acquiring voice signals in response to a recording starting instruction sent out based on a first position of a target video frame in the editing interface;
wherein adding a text label associated with the recognized text on the target video frame comprises:
and adding the text mark at the first position of the target video frame.
39. A video viewing method, comprising:
displaying a video with a text label, the voice label being added to the video by the method of claim 37, the video including a first video frame with a first text label;
and in response to a triggering instruction of the first text mark, displaying the identification text associated with the text mark.
40. An apparatus for voice tagging a picture, comprising:
the display unit is configured to display a recording interface comprising a target picture;
the acquisition unit is configured to respond to a recording start instruction sent out based on the recording interface and continuously acquire voice signals;
the storage unit is configured to respond to a recording ending instruction sent out based on the recording interface and store the collected voice signals as audio files;
an adding unit configured to add a voice tag associated with the audio file on the target picture.
41. A picture viewing device, comprising:
a display unit configured to display a picture with a voice tag added to the picture by the apparatus of claim 40;
and the playing unit is configured to respond to a triggering instruction of the voice mark and play the audio file associated with the voice mark.
42. An apparatus for voice tagging video, comprising:
a display unit configured to display a recording interface including a first video, the first video including a first video frame;
the determining unit is configured to respond to a selection instruction of the first video frame, and determine the first video frame as a target video frame;
the acquisition unit is configured to respond to a recording start instruction sent out based on the recording interface and continuously acquire voice signals;
the storage unit is configured to respond to a recording ending instruction sent out based on the recording interface and store the collected voice signals as audio files;
an adding unit configured to add a voice tag associated with the audio file on the target video frame.
43. A video viewing device, comprising:
a display unit configured to display a video with a voice tag added to the video by the apparatus of claim 42, the video comprising a first video frame with a first voice tag;
and the playing unit is configured to respond to a triggering instruction of the first voice mark and play the audio file associated with the first voice mark.
44. An apparatus for text tagging of pictures, comprising:
a display unit configured to display an editing interface for a target picture;
the acquisition unit is configured to respond to a recording starting instruction sent out based on the editing interface and continuously acquire voice signals;
the recognition unit is configured to respond to a recording ending instruction sent out based on the editing interface, perform voice recognition on the collected voice signal and obtain a recognition text;
an adding unit configured to add a text label associated with the recognition text on the target picture.
45. A picture viewing device, comprising:
a display unit configured to display a picture with text labels added to the picture by the apparatus of claim 44;
and the display unit is configured to respond to a triggering instruction of the text mark and display the identification text associated with the text mark.
46. An apparatus for text labeling video, comprising:
a display unit configured to display an editing interface including a first video, the first video including a first video frame;
the determining unit is configured to respond to a selection instruction of the first video frame, and determine the first video frame as a target video frame;
the acquisition unit is configured to respond to a recording starting instruction sent out based on the editing interface and continuously acquire voice signals;
the recognition unit is configured to respond to a recording ending instruction sent out based on the editing interface, perform voice recognition on the collected voice signal and obtain a recognition text;
an adding unit configured to add a text label associated with the recognition text on the target video frame.
47. A video viewing device, comprising:
a display unit configured to display a video with a text label, the voice label being added to the video by the apparatus of claim 46, the video comprising a first video frame with a first text label;
the display unit is configured to respond to a trigger instruction of the first text mark and display the identification text associated with the text mark.
48. A picture processing method comprises the following steps:
displaying a chat interface, and receiving a target picture to be sent selected based on the chat interface;
responding to an editing instruction aiming at the target picture, entering a picture editing interface, wherein a voice mark icon is displayed;
responding to a triggering instruction of the voice mark icon, and entering a recording interface;
and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
49. The method of claim 48, wherein adding a voice tag associated with the audio file on the target picture comprises:
and displaying the chat nickname and/or the chat head portrait of the current editing user in the voice mark.
50. A picture processing method comprises the following steps:
displaying a chat interface, wherein a chat window of the chat interface comprises a target picture;
responding to a trigger instruction aiming at the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon;
responding to a triggering instruction of the voice mark icon, and entering a recording interface;
and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
51. The method of claim 50, wherein prior to displaying the chat interface, the method further comprises:
receiving the target picture from a contact corresponding to the chat window, wherein the target picture comprises an existing voice mark, and a chat nickname and/or a chat head portrait of the contact are/is displayed;
wherein adding a voice tag associated with the audio file to the target picture comprises:
and displaying the chat nickname and/or the chat head portrait of the current editing user in the voice mark.
52. The method of claim 50, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and responding to an exit instruction aiming at the recording interface, and updating and displaying the original target picture as the target picture with the voice mark in the chat window.
53. The method of claim 50, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and displaying prompt information in the chat window to inform all parties that the target picture of the contact person is modified.
54. A picture processing method comprises the following steps:
displaying a chat interface, wherein a chat window of the chat interface comprises a target picture with a first voice mark, and the first voice mark is added by a current contact corresponding to the chat window;
responding to a trigger instruction of the first voice mark, and displaying a menu bar, wherein the menu bar comprises a voice reply icon;
responding to a triggering instruction of the voice reply icon, and entering a recording interface;
storing the voice signals collected based on the recording interface as an audio file, and adding a second voice mark related to the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
55. The method of claim 54, wherein the method further comprises:
and responding to an exit instruction aiming at the recording interface, and displaying prompt information in the chat window so as to inform all parties that the voice marks in the target pictures of the contacts are modified.
56. A method of picture processing, the method comprising:
displaying a picture editing interface containing a target picture, wherein a function menu of the picture editing interface comprises a voice mark icon;
responding to a triggering instruction of the voice mark icon, and entering a voice mark interface;
and converting the input text received based on the voice mark interface into an audio file, and adding a voice mark related to the audio file on the target picture.
57. A picture processing method comprises the following steps:
displaying a chat interface containing a target picture;
responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice adding emoticon;
responding to a triggering instruction of the voice adding emoticon, and entering a voice adding emoticon interface;
converting the voice signals collected based on the voice adding expression interface into characters, and generating animation expressions based on the characters;
and adding the animation expression in the target picture.
58. The method of claim 57, wherein generating an animated expression based on the text comprises:
retrieving an original expression associated with the text from an expression library;
and adding the characters in the original expression to obtain the animation expression.
59. An image processing method, wherein an execution subject of the method is an e-commerce platform, and the method comprises the following steps:
displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture for a target commodity;
responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon;
responding to a triggering instruction of the voice mark icon, and entering a recording interface;
and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
60. The method of claim 59, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
and displaying the target picture in a commodity detail page aiming at the target commodity.
61. A picture processing method comprises the following steps:
displaying an order evaluation interface for the first order, wherein the order evaluation interface comprises an added picture icon;
responding to the trigger instruction aiming at the picture adding icon, and receiving a selected target picture;
responding to a voice marking instruction sent to the target picture, and entering a recording interface;
and storing the voice signal acquired based on the recording interface as an audio file, and adding a voice mark associated with the audio file to the target electronic file.
62. A picture processing method comprises the following steps:
displaying a commodity evaluation interface aiming at a target commodity, wherein the commodity evaluation interface comprises first user evaluation, and the first user evaluation comprises a target picture with a first voice mark;
responding to a trigger instruction aiming at the target picture, and displaying a menu bar, wherein the menu bar comprises a voice reply icon;
responding to a triggering instruction of the voice reply icon, and entering a recording interface;
storing the voice signals collected based on the recording interface as an audio file, and adding a second voice mark related to the audio file in an area adjacent to the first voice mark in the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
63. The method of claim 62, wherein the method further comprises:
in response to the quit instruction of the recording interface, displaying a notification message of the first user evaluation in the commodity evaluation interface for notifying that the first user evaluation is replied.
64. A picture processing method, the execution subject of the method is a customer service platform, and the method comprises the following steps:
receiving a conversation message sent by a user, wherein the conversation message comprises a target picture with a first voice mark;
acquiring an audio file associated with the first voice mark, and performing voice recognition on the audio file to obtain a recognition text;
inputting the recognition text into a pre-trained user question prediction model, and outputting a corresponding user standard question;
and feeding back the question answers corresponding to the user standard questions to the user.
65. The method of claim 64, wherein feeding back to the user answers to questions corresponding to the user standard questions comprises:
converting the question answers into answer audio, and adding a second voice tag associated with the answer audio in the target picture;
and sending the target picture newly added with the second voice mark to the user.
66. The method of claim 64, wherein feeding back to the user answers to questions corresponding to the user standard questions comprises:
converting the question answers to a reply audio;
adding a second voice tag associated with the reply audio in an area adjacent to the first voice tag in the target picture; or, adding the reply audio to the audio file corresponding to the first voice mark;
and sending prompt information to the user to prompt the user that the reply content is added to the target picture.
67. A method of processing an electronic document, comprising:
displaying a file processing interface aiming at a target electronic file, wherein a function menu bar of the file processing interface comprises a voice mark icon;
responding to a triggering instruction of the voice mark icon, and entering a recording interface;
and storing the voice signal acquired based on the recording interface as an audio file, and adding a voice mark associated with the audio file to the target electronic file.
68. The method of claim 67, wherein the target electronic document is an electronic contract and the document processing interface is a contract approval interface.
69. The method of claim 67, wherein the target electronic file has a file format of a word document, or a PDF document, or an excel form.
70. A picture processing method, the execution subject of the method is a live broadcast platform, and the method comprises the following steps:
displaying a commodity information editing interface, wherein the commodity information editing interface comprises a target picture of a target commodity to be placed on a shelf;
responding to a trigger instruction of the target picture, and displaying a menu bar, wherein the menu bar comprises a voice mark icon;
responding to a triggering instruction of the voice mark icon, and entering a recording interface;
and storing the voice signals collected based on the recording interface as an audio file, and adding a voice mark associated with the audio file on the target picture.
71. The method of claim 70, wherein after adding a voice tag associated with the audio file on the target picture, the method further comprises:
displaying a live broadcast interface, wherein the live broadcast interface comprises a commodity shelf icon;
responding to a triggering instruction of the commodity shelving image, and displaying a picture of a commodity to be shelved, wherein the picture comprises the target picture;
and responding to a selection instruction of the target picture, and displaying the target picture in a commodity display window of the live broadcast interface.
72. A picture processing apparatus comprising:
a display unit configured to display a chat interface;
the receiving unit is configured to receive a target picture to be sent, which is selected based on the chat interface;
the first interface switching unit is configured to respond to an editing instruction aiming at the target picture, enter a picture editing interface, and display a voice mark icon;
the second interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
73. A picture processing apparatus comprising:
the interface display unit is configured to display a chat interface, and a chat window of the chat interface comprises a target picture;
the menu bar display unit is configured to respond to a trigger instruction aiming at the target picture and display a menu bar, and the menu bar comprises a voice mark icon;
the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
74. A picture processing apparatus comprising:
the interface display unit is configured to display a chat interface, a chat window of the chat interface comprises a target picture with a first voice mark, and the first voice mark is added by a current contact corresponding to the chat window;
the menu bar display unit is configured to respond to a trigger instruction of the first voice mark and display a menu bar, and the menu bar comprises a voice reply icon;
the interface switching unit is configured to respond to a trigger instruction of the voice reply icon and enter a recording interface;
the marking unit is configured to store the voice signals collected based on the recording interface as audio files, and add second voice marks related to the audio files in the area, adjacent to the first voice marks, of the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
75. A picture processing apparatus, the apparatus comprising:
the display unit is configured to display a picture editing interface containing a target picture, and a function menu of the picture editing interface comprises a voice mark icon;
the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a voice mark interface;
and the marking unit is configured to convert the input text received based on the voice marking interface into an audio file and add a voice mark related to the audio file on the target picture.
76. A picture processing apparatus comprising:
the interface display unit is configured to display a chat interface containing the target picture;
the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice adding emoticon;
the interface switching unit is configured to respond to a triggering instruction of the voice adding emoticon and enter a voice adding emoticon interface;
the expression generating unit is configured to convert the voice signals collected based on the voice expression adding interface into characters and generate animation expressions based on the characters;
and the expression adding unit is configured to add the animation expression in the target picture.
77. A picture processing device, the device being integrated with an e-commerce platform, the device comprising:
the interface display unit is configured to display a commodity information editing interface, wherein the commodity information editing interface comprises a target picture aiming at a target commodity;
the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice mark icon;
the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
78. A picture processing apparatus comprising:
the display unit is configured to display an order evaluation interface for the first order, wherein the order evaluation interface comprises an added picture icon;
the receiving unit is configured to respond to the trigger instruction for the picture adding icon and receive the selected target picture;
the interface switching unit is configured to respond to a voice marking instruction sent to the target picture and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files to the target electronic files.
79. A picture processing apparatus comprising:
the interface display unit is configured to display a commodity evaluation interface for a target commodity, wherein the commodity evaluation interface comprises a first user evaluation, and the first user evaluation comprises a target picture with a first voice mark;
the menu bar display unit is configured to respond to a trigger instruction aiming at the target picture and display a menu bar, and the menu bar comprises a voice reply icon;
the interface switching unit is configured to respond to a trigger instruction of the voice reply icon and enter a recording interface;
the marking unit is configured to store the voice signals collected based on the recording interface as audio files, and add second voice marks related to the audio files in the area, adjacent to the first voice marks, of the target picture; or adding the voice signal acquired based on the recording interface into the audio file corresponding to the first voice mark.
80. A picture processing device, the device being integrated into a customer service platform, the device comprising:
the receiving unit is configured to receive a conversation message sent by a user, wherein the conversation message comprises a target picture with a first voice mark;
the acquisition unit is configured to acquire an audio file associated with the first voice tag, and perform voice recognition on the audio file to obtain a recognition text;
the prediction unit is configured to input the recognition text into a pre-trained user question prediction model and output a corresponding user standard question;
a feedback unit configured to feed back a question answer corresponding to the user standard question to the user.
81. An apparatus for processing an electronic document, comprising:
the display unit is configured to display a file processing interface aiming at a target electronic file, and a function menu bar of the file processing interface comprises a voice mark icon;
the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files to the target electronic files.
82. A picture processing device, the device being integrated with a live platform, the device comprising:
the interface display unit is configured to display a commodity information editing interface, wherein the commodity information editing interface comprises a target picture of a target commodity to be placed on a shelf;
the menu bar display unit is configured to respond to a trigger instruction of the target picture and display a menu bar, and the menu bar comprises a voice mark icon;
the interface switching unit is configured to respond to a triggering instruction of the voice mark icon and enter a recording interface;
and the marking unit is configured to store the voice signals collected based on the recording interface as audio files and add voice marks related to the audio files on the target picture.
83. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-39, 48-71.
84. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-39, 48-71.
CN202010167913.8A 2020-03-11 2020-03-11 Method and device for voice marking of pictures and videos Pending CN113392272A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010167913.8A CN113392272A (en) 2020-03-11 2020-03-11 Method and device for voice marking of pictures and videos
PCT/CN2021/080145 WO2021180155A1 (en) 2020-03-11 2021-03-11 Method and apparatus for voice marking image and video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010167913.8A CN113392272A (en) 2020-03-11 2020-03-11 Method and device for voice marking of pictures and videos

Publications (1)

Publication Number Publication Date
CN113392272A true CN113392272A (en) 2021-09-14

Family

ID=77615418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010167913.8A Pending CN113392272A (en) 2020-03-11 2020-03-11 Method and device for voice marking of pictures and videos

Country Status (2)

Country Link
CN (1) CN113392272A (en)
WO (1) WO2021180155A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979054A (en) * 2022-05-13 2022-08-30 维沃移动通信有限公司 Video generation method and device, electronic equipment and readable storage medium
CN115102917A (en) * 2022-06-28 2022-09-23 维沃移动通信有限公司 Message sending method, message processing method and device
TWI820677B (en) * 2022-04-18 2023-11-01 開曼群島商粉迷科技股份有限公司 Method, system for providing location-based content linking icon and computer-readable recording medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117714766A (en) * 2022-09-09 2024-03-15 抖音视界有限公司 Video content preview interaction method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140344854A1 (en) * 2013-05-17 2014-11-20 Aereo, Inc. Method and System for Displaying Speech to Text Converted Audio with Streaming Video Content Data
CN104469544A (en) * 2014-11-07 2015-03-25 重庆晋才富熙科技有限公司 Video marking method based on voice technology
CN104333817A (en) * 2014-11-07 2015-02-04 重庆晋才富熙科技有限公司 Method for quickly marking video
CN108092873B (en) * 2017-10-27 2021-01-15 颜厥护 Instant messaging method and system
CN110215707B (en) * 2019-07-12 2023-05-05 网易(杭州)网络有限公司 Method and device for voice interaction in game, electronic equipment and storage medium
CN110381382B (en) * 2019-07-23 2021-02-09 腾讯科技(深圳)有限公司 Video note generation method and device, storage medium and computer equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI820677B (en) * 2022-04-18 2023-11-01 開曼群島商粉迷科技股份有限公司 Method, system for providing location-based content linking icon and computer-readable recording medium
CN114979054A (en) * 2022-05-13 2022-08-30 维沃移动通信有限公司 Video generation method and device, electronic equipment and readable storage medium
CN115102917A (en) * 2022-06-28 2022-09-23 维沃移动通信有限公司 Message sending method, message processing method and device

Also Published As

Publication number Publication date
WO2021180155A1 (en) 2021-09-16

Similar Documents

Publication Publication Date Title
WO2021180155A1 (en) Method and apparatus for voice marking image and video
US7873911B2 (en) Methods for providing information services related to visual imagery
CN105635764B (en) Method and device for playing push information in live video
US8635293B2 (en) Asynchronous video threads
CN104956357A (en) Creating and sharing inline media commentary within a network
CN113536172B (en) Encyclopedia information display method and device and computer storage medium
CN113573129B (en) Commodity object display video processing method and device
CN111629253A (en) Video processing method and device, computer readable storage medium and electronic equipment
CN103530320A (en) Multimedia file processing method and device and terminal
CN113395605B (en) Video note generation method and device
CN112866783A (en) Comment interaction method and device and electronic equipment
CN112329403A (en) Live broadcast document processing method and device
CN112004031B (en) Video generation method, device and equipment
CN112907703A (en) Expression package generation method and system
CN112073740A (en) Information display method, device, server and storage medium
CN110855557A (en) Video sharing method and device and storage medium
JP5475259B2 (en) Text information sharing method, server device, and client device
CN113886610A (en) Information display method, information processing method and device
CN114139525A (en) Data processing method and device, electronic equipment and computer storage medium
CN112052315A (en) Information processing method and device
Jokela et al. Mobile video editor: design and evaluation
CN112073738B (en) Information processing method and device
CN113709565B (en) Method and device for recording facial expression of watching video
CN110209870B (en) Music log generation method, device, medium and computing equipment
CN115174536A (en) Audio playing method and device and nonvolatile computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination