WO2018224032A1 - Procédé et dispositif de gestion multimédia - Google Patents

Procédé et dispositif de gestion multimédia Download PDF

Info

Publication number
WO2018224032A1
WO2018224032A1 PCT/CN2018/090400 CN2018090400W WO2018224032A1 WO 2018224032 A1 WO2018224032 A1 WO 2018224032A1 CN 2018090400 W CN2018090400 W CN 2018090400W WO 2018224032 A1 WO2018224032 A1 WO 2018224032A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
user
instruction
voice
tag
Prior art date
Application number
PCT/CN2018/090400
Other languages
English (en)
Chinese (zh)
Inventor
马靖博
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018224032A1 publication Critical patent/WO2018224032A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to the field of multimedia technologies, and in particular, to a multimedia management method and apparatus.
  • the existing multimedia playback control means in addition to the basic play, pause, and the like, there is also a means for playing a label in the playback of the video, that is, in a video, the video is segmented in a time node manner. This allows the user to watch the video with the split label, allowing the user to see what they want to see faster.
  • the label management method in the related art is a rigid and fixed management method, and the user cannot manage according to his own preferences and actual needs, and it is difficult to meet the diversified needs of different users.
  • Embodiments of the present disclosure provide a multimedia management method and apparatus, aiming to provide a more flexible and free multimedia management.
  • An embodiment of the present disclosure provides a multimedia management method, including: receiving voice information of a user; extracting feature information and instruction information of the user from the voice information; and managing a corresponding multimedia file according to the feature information and the instruction information.
  • the embodiment of the present disclosure further provides a multimedia management apparatus, including: a voice input module configured to receive voice information of a user; and a voice recognition module configured to extract feature information and instruction information of the user from the voice information; An instruction processing module configured to manage the corresponding multimedia file according to the feature information and the instruction information.
  • Embodiments of the present disclosure also provide a recording medium having stored thereon program code that is executed when the program code is executed by a processor, the processor executing a multimedia management method according to the present disclosure.
  • FIG. 1 is a flowchart of a multimedia management method according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a multimedia management method according to another embodiment of the present disclosure.
  • FIG. 3 is a flowchart of a method for playing a multimedia file according to an embodiment of the present disclosure
  • FIG. 4 is a flowchart of a method for adding a label to a multimedia file according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of a method for jumping a multimedia file according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of a multimedia management apparatus according to an embodiment of the present disclosure.
  • the idea of the present disclosure is that in the traditional multimedia management mode, the user's audio information is added for customizable management, thereby improving the flexibility and freedom of user management, and the reliability is high and the user experience is good.
  • FIG. 1 is a flowchart of a multimedia management method according to an embodiment of the present disclosure.
  • the multimedia management method of the embodiment of the present disclosure includes steps S101 to S103.
  • step S101 voice information of the user is received.
  • step S102 feature information and instruction information of the user are extracted from the voice information.
  • step S103 the corresponding multimedia file is managed based on the feature information and the instruction information.
  • Multimedia files are timeline based files, ie, multimedia files have time attributes. For a certain audio or video file, at a certain point in time, it always corresponds to a certain determined content. For audio files, only audio content is included, and for video files, audio content and video content may be included. Furthermore, due to the curved nature of the audio, it is also possible to correspond to the determined audio curve at a certain point in time. In other words, the time attribute can be used to directly locate the location specified in the multimedia file.
  • the voice information of the user is received in step S101.
  • the user's voice information may be voice information sent by the user itself (such as voice information sent by the user in a talking manner) or voice information sent by the user through other devices (such as electronic translation, or recorded by other recording devices). User's voice, etc.).
  • the user's voice information reflects the control operations that the user wants to perform on the multimedia file, including control logic such as regular play commands, pause commands, and the like.
  • step S102 feature information of the user and instruction information of the multimedia file are extracted from the voice information.
  • the feature information of the user can be extracted.
  • the extracted feature information can be used as identification information of the same user.
  • it can be determined that the voice information belongs to the same user, and correspondingly, the voice information of different users can be distinguished by the feature information.
  • the feature information may include voiceprint information of the user.
  • the voiceprint information may include a sound wave spectrum carrying speech information. Sound waves are not only specific, but also have relatively stable characteristics, especially in adulthood, human voice can remain relatively stable for a long time. Experiments have shown that no matter whether the speaker deliberately imitates the voice and tone of others, or whispers softly speaking, even if the imitation is vivid, the voiceprint is always different. Based on this feature of voiceprint, voice information belonging to the same user and not belonging to the same user can be accurately summarized and distinguished.
  • the instruction information for managing the multimedia file refers to a specific operation performed on the multimedia file expressed by the voice information transmitted by the user.
  • This operational logic is communicated to the system in the form of voice messages. Due to the particularity of voice information, differences in voiceprints, differences in races, and differences in language between individuals can cause differences in the form of instruction information contained in voice messages. For example, when the user is a Chinese, the voice information can be Mandarin, or some local dialect, or even a sentence mixed with some English. When the user is French, the voice message can be French.
  • the instruction information for extracting the management multimedia file may be performed by various means for different languages, and different analysis methods may be selected by the feature information in the voice information.
  • extracting the feature information and the instruction information of the user from the voice information may include: constructing voiceprint information of the user according to the voice information, the feature information includes voiceprint information; and performing voice recognition and natural semantics according to the voice information Analyze, and determine the instruction information based on the analysis result.
  • Natural semantic analysis refers to the analysis of the meaning expressed by the voice information, and the results of different interpretations according to the language are also different.
  • the instruction information may include, but is not limited to, at least one of a play instruction, a pause instruction, a stop instruction, a jump instruction, a create directory instruction, an open directory instruction, a tag addition instruction, and the like.
  • a play command, a pause command, a stop command, and the like are conventional control commands of a multimedia file.
  • the corresponding control can be implemented based on parsing the above instructions from the voice information. For example, for Chinese, the voice information including the play command may be “play xxx”, and for English, the voice information including the play command may be “play xxx” or the like.
  • the voice information may also include voice content about the playback object, such as the file name of the multimedia file, or a part of the file, etc., according to which the corresponding multimedia file can be directly opened and played.
  • the step of managing the corresponding multimedia file according to the feature information and the instruction information may include: determining whether the user is an existing user according to the voiceprint information of the user; if the user already exists, According to the instruction information, the multimedia file is managed; if the user does not already exist, the voiceprint information is saved, the directory corresponding to the user is created based on the voiceprint information of the user, and the multimedia file is managed according to the instruction information.
  • the voiceprint information extracted from the voice information is compared with the existing voiceprint information of the user, and based on the result of the comparison, whether the voiceprint information is the voiceprint information of the existing user can be determined.
  • the voice information can be parsed according to the analysis mode of the existing user to obtain the instruction information, and corresponding processing is performed according to the instruction information; if not the voiceprint information of the existing user, There is no user corresponding to the current voice information in the existing user information, that is, the current user is a new user. If you want to save the new user's information in the system, you can: first save the voiceprint information, then create a directory corresponding to the new user based on the voiceprint information, and perform corresponding processing according to the instruction information. If it is not necessary to save the new user's information in the system, the instruction information can be directly parsed, and then the corresponding operation is performed according to the instruction information.
  • the step of managing the corresponding multimedia file according to the feature information and the instruction information may include: determining a time point at which the voice information is entered; The preset compensation time is subtracted to obtain the label time point; and the voice tag is created at the tag time point according to the specific content of the tag addition instruction in the instruction information.
  • Voice tags are a means of marking multimedia files. At a certain point in time of the voice tag, that is, at a certain time point corresponding to the time attribute of the multimedia file, preset or user-defined tag content is set, so that the user can conveniently perform quick positioning during viewing or listening.
  • the process of adding a voice tag is roughly as follows.
  • the user views or listens to the multimedia file, for example, the user is watching the video, when viewing a location where the user thinks that the tag needs to be added, the user issues a voice message including an instruction to add a tag.
  • the point in time is the point in time at which the voice information is entered; however, this point in time has actually passed the point in time at which the user wants to add a label, because the user must first see the video content and then issue a labeling instruction, therefore,
  • the time point at which the real voice tag is located should be the time point obtained by subtracting the preset compensation time from the time point of the entry.
  • the compensation time can be determined according to the user's viewing habits, and can have different compensation time for different users. In addition, the same user can have different compensation times when viewing different video files.
  • the compensation time is intended to help the user mark the location of the multimedia that needs to be marked, and the accuracy requirement can be within a certain range, that is, the label time point is within a certain range of the true desired mark position. Just fine.
  • the user can control the multimedia file to forward or backward by voice information, or the user can manually adjust.
  • the content of the voice tag may be modified according to the needs of the user, and the modification may include modifying the tag time point of the voice tag and modifying the tag content of the voice tag.
  • the step of managing the corresponding multimedia file according to the feature information and the instruction information may include: determining a specific instruction according to the jump instruction in the instruction information. The content is matched with the existing voice tag. When the matching degree reaches the preset threshold, the playback progress of the multimedia file is jumped to the tag time point corresponding to the matched voice tag. Jumping is to jump the progress of multimedia playback directly to the desired tag time point.
  • the jump operation may include two situations: if the multimedia file is being played, the playback progress is directly jumped to the tag time point corresponding to the voice tag; if the multimedia file is not opened, the multimedia file is opened, and the playback progress is jumped to the voice.
  • the jump operation can be performed based on the set voice tag. For the multimedia file, when the multimedia file is opened, it jumps directly to the label time point corresponding to the corresponding audio label; when the multimedia file is not opened, the multimedia file is first opened, and then the corresponding corresponding information is jumped according to the voice information.
  • the audio tag directly jumps the playback progress to the corresponding tag time point for playback.
  • the voice information of the same user may directly match the voices of the two, and when the matching degree is greater than the setting
  • the threshold is triggered when a jump instruction is triggered.
  • the content in the voice information may at least include the jump instruction and the label content of the voice tag, and may further include a file name of the multimedia file to control the unopened multimedia file.
  • the instruction information may be a directory creation instruction or an open directory instruction.
  • the create directory instruction is usually instruction information when a new user is added.
  • voice information containing new voiceprint information if the voice information includes a create directory instruction, the voiceprint information is saved and a directory corresponding to the user is created.
  • the directory instruction for opening the directory of the corresponding user's voice tag may be presented on the display panel as the content of the text, or automatically played by the voice information, or played under the control of the user, such as on the display panel. An icon of the player is presented, which is played by the user clicking, or the user can play by voice information.
  • the embodiment of the present disclosure provides a multimedia management method.
  • the multimedia management method of the embodiment of the present disclosure manages the multimedia file in combination with the user's voice information, thereby realizing the user's flexible management of the multimedia file. Control meets the diversified needs of users for management and enhances the user experience.
  • FIG. 2 is a flow chart of a multimedia management method according to another embodiment of the present disclosure.
  • the multimedia management method according to another embodiment of the present disclosure may include steps S201 to S205.
  • step S201 the voice information recorded by the user is acquired.
  • step S202 feature information in the voice information is extracted.
  • step S203 voice recognition is performed to extract instruction information therein.
  • step S204 the corresponding multimedia file is managed according to the instruction information.
  • step S205 the feature information in the entered voice information is saved.
  • Steps S201 to S204 correspond to steps S101 to S103 in the foregoing embodiment, that is, step S201 corresponds to step S101 in the foregoing embodiment, steps S202 and S203 correspond to step S102 in the foregoing embodiment, and step S204 corresponds to Step S103 in the foregoing embodiment, and thus a detailed description of these steps is omitted here.
  • step S205 the feature information in the voice mailbox entered by the user is saved, for example, the voiceprint information of the user is constructed according to the voice information, regardless of whether the user is already Some users.
  • FIG. 3 is a flowchart of a method for playing a multimedia file according to an embodiment of the present disclosure.
  • the step of performing corresponding processing on the multimedia file according to the instruction information may include steps S301 to S304.
  • step S301 it is determined whether it is the command information of the play command, and if so, the process proceeds to step S302.
  • step S302 it is determined whether the command information corresponds to an existing user, and if so, the process goes to step S303, and if no, the process goes to step S304.
  • step S303 the corresponding multimedia file is opened and played.
  • step S304 the user is prompted to have no relevant voice tag, and the corresponding multimedia file is played from the beginning.
  • FIG. 4 is a flowchart of a method for adding a label to a multimedia file according to an embodiment of the present disclosure.
  • step S204 in FIG. 2 when the instruction information includes an add label instruction, the step of performing corresponding processing on the multimedia file according to the instruction information (ie, step S204 in FIG. 2) may include steps S401 to S406.
  • step S401 it is determined whether or not the instruction information of the tag instruction is added; if yes, the process proceeds to step S402.
  • step S402 a time point at which voice information is recorded is acquired.
  • step S403 the label time point corresponding to the voice tag is calculated.
  • a voice tag is generated.
  • step S405 a voice tag is added to the corresponding user directory.
  • step S406 the feedback voice tag is added with successful prompt information.
  • FIG. 5 is a flowchart of a method for jumping a multimedia file according to an embodiment of the present disclosure.
  • step S204 in FIG. 2 when the instruction information includes a jump instruction, the step of performing corresponding processing on the multimedia file according to the instruction information (ie, step S204 in FIG. 2) may include steps S501 to S508.
  • step S501 it is determined whether it is the instruction information of the jump instruction; if so, the flow proceeds to step S502.
  • step S502 it is determined whether it is a jump instruction for viewing the tag directory; if yes, go to step S503, if no, go to step S505.
  • step S503 it is determined whether or not there is a corresponding tag directory; if so, the process goes to step S504, and if no, the process goes to step S507.
  • step S504 the tag directory corresponding to the user is displayed.
  • step S505 it is determined whether there is a user and tag content matching the jump instruction. If yes, the process goes to step S506, and if no, the process goes to step S508.
  • step S506 the multimedia file is jumped to the tag time point corresponding to the voice tag, and played.
  • step S507 the user is prompted to have no corresponding tag directory.
  • step S508 the user is prompted to have no corresponding voice tag.
  • step S502 if the jump instruction specifies the content of the voice tag, it is not necessary to display the tag directory; if the jump instruction does not specify the content of the specific voice tag, then the tag directory needs to be displayed, that is, the tag directory is viewed.
  • the step S505 may include: matching the specific content of the jump instruction with the existing voice tag. If the matching degree reaches a preset threshold, it indicates that there is a related voice tag; if no matching degree reaches a preset threshold, the description does not indicate that This tag exists. This may be because the user has misread the content of the voice tag, or the voice message entered by the user is incorrect.
  • FIG. 6 is a schematic structural diagram of a multimedia management apparatus according to an embodiment of the present disclosure.
  • the multimedia management device may include a voice entry module 601, a voice recognition module 602, and an instruction processing module 603.
  • the voice entry module 601 is configured to receive voice information of the user.
  • the speech recognition module 602 is configured to extract feature information and instruction information of the user from the voice information.
  • the instruction processing module 603 is configured to manage the corresponding multimedia file based on the feature information and the instruction information. That is, the multimedia management apparatus shown in FIG. 6 is used to execute the multimedia management method according to the embodiment of the present disclosure shown in FIG. 1.
  • the respective modules shown in FIG. 6 are respectively used to execute the respective method steps shown in FIG. .
  • the specific description of each step described with reference to FIG. 1 can be applied to the respective modules shown in FIG. 6, and details are not described herein again.
  • the multimedia management device may further include a voice storage module 604 and a display module 605.
  • the voice storage module 604 is configured to store feature information (eg, voiceprint information) of the user that has been extracted by the voice recognition module 602.
  • the display module 605 is configured to present corresponding content to the user according to the instruction information extracted in the voice information, such as playing a video file, presenting a label directory, and/or text information of a voice tag, and the like.
  • modules or steps described above can be implemented with a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. According to an embodiment of the present disclosure, they may be implemented by program code executable by a computing device, such that they may be stored in a storage medium (ROM/RAM, disk, optical disk) by a computing device, and in some cases
  • ROM/RAM, disk, optical disk a storage medium
  • the steps shown or described may be performed in a different order than that herein, or they may be separately fabricated into individual integrated circuit modules, or a plurality of the modules or steps may be implemented as a single integrated circuit module. Therefore, the present disclosure is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un procédé et un dispositif de gestion multimédia. Le procédé de gestion multimédia comprend : réception d'informations vocales d'un utilisateur (S101) ; extraction d'informations caractéristiques et d'informations d'instruction de l'utilisateur à partir des informations vocales (S102) ; et gestion d'un fichier multimédia correspondant en fonction des informations caractéristiques et des informations d'instruction (S103).
PCT/CN2018/090400 2017-06-08 2018-06-08 Procédé et dispositif de gestion multimédia WO2018224032A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710428940.4 2017-06-08
CN201710428940.4A CN109033099A (zh) 2017-06-08 2017-06-08 一种多媒体管理方法和装置

Publications (1)

Publication Number Publication Date
WO2018224032A1 true WO2018224032A1 (fr) 2018-12-13

Family

ID=64566419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/090400 WO2018224032A1 (fr) 2017-06-08 2018-06-08 Procédé et dispositif de gestion multimédia

Country Status (2)

Country Link
CN (1) CN109033099A (fr)
WO (1) WO2018224032A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114168764B (zh) * 2021-11-04 2024-05-17 海南视联通信技术有限公司 一种多媒体数据处理方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226966A (zh) * 2013-04-26 2013-07-31 广东欧珀移动通信有限公司 一种可快速定位播放进度的方法及移动终端
CN103399737A (zh) * 2013-07-18 2013-11-20 百度在线网络技术(北京)有限公司 基于语音数据的多媒体处理方法及装置
CN105872619A (zh) * 2015-12-15 2016-08-17 乐视网信息技术(北京)股份有限公司 一种视频播放记录的匹配方法及匹配装置
CN106372246A (zh) * 2016-09-20 2017-02-01 深圳市同行者科技有限公司 音频播放方法及其装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990358B2 (en) * 2013-03-15 2015-03-24 Michael Sharp Systems and methods for expedited delivery of media content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226966A (zh) * 2013-04-26 2013-07-31 广东欧珀移动通信有限公司 一种可快速定位播放进度的方法及移动终端
CN103399737A (zh) * 2013-07-18 2013-11-20 百度在线网络技术(北京)有限公司 基于语音数据的多媒体处理方法及装置
CN105872619A (zh) * 2015-12-15 2016-08-17 乐视网信息技术(北京)股份有限公司 一种视频播放记录的匹配方法及匹配装置
CN106372246A (zh) * 2016-09-20 2017-02-01 深圳市同行者科技有限公司 音频播放方法及其装置

Also Published As

Publication number Publication date
CN109033099A (zh) 2018-12-18

Similar Documents

Publication Publication Date Title
CN107659847B (zh) 语音互动方法和装置
CN108133707B (zh) 一种内容分享方法及系统
US10210769B2 (en) Method and system for reading fluency training
US8302010B2 (en) Transcript editor
EP3477958A1 (fr) Empêcher l'activation d'un dispositif mains libres
CN106796496B (zh) 显示设备及其操作方法
CN109754783B (zh) 用于确定音频语句的边界的方法和装置
US9066049B2 (en) Method and apparatus for processing scripts
US20200184948A1 (en) Speech playing method, an intelligent device, and computer readable storage medium
JP2018170015A (ja) 情報処理装置
CN111885416B (zh) 一种音视频的修正方法、装置、介质及计算设备
US11049490B2 (en) Audio playback device and audio playback method thereof for adjusting text to speech of a target character using spectral features
US20220093103A1 (en) Method, system, and computer-readable recording medium for managing text transcript and memo for audio file
US20190155843A1 (en) A secure searchable media object
CN104349173A (zh) 视频复读方法及装置
CN109376145B (zh) 影视对白数据库的建立方法、建立装置及存储介质
WO2018224032A1 (fr) Procédé et dispositif de gestion multimédia
CN110890095A (zh) 语音检测方法、推荐方法、装置、存储介质和电子设备
US20140207454A1 (en) Text reproduction device, text reproduction method and computer program product
CN113761865A (zh) 声文重对齐及信息呈现方法、装置、电子设备和存储介质
CN113221514A (zh) 文本处理方法、装置、电子设备和存储介质
US10657202B2 (en) Cognitive presentation system and method
CN112837668A (zh) 一种语音处理方法、装置和用于处理语音的装置
CN113241061B (zh) 语音识别结果的处理方法、装置、电子设备和存储介质
KR20190099676A (ko) 사용자의 발화를 기반으로 컨텐츠를 제공하는 장치 및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18813827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18813827

Country of ref document: EP

Kind code of ref document: A1