CN111629267B - Audio labeling method, device, equipment and computer readable storage medium - Google Patents

Audio labeling method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111629267B
CN111629267B CN202010371102.XA CN202010371102A CN111629267B CN 111629267 B CN111629267 B CN 111629267B CN 202010371102 A CN202010371102 A CN 202010371102A CN 111629267 B CN111629267 B CN 111629267B
Authority
CN
China
Prior art keywords
audio
video
labeling
annotation
identity information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010371102.XA
Other languages
Chinese (zh)
Other versions
CN111629267A (en
Inventor
蒋亚雄
刘洪�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010371102.XA priority Critical patent/CN111629267B/en
Publication of CN111629267A publication Critical patent/CN111629267A/en
Application granted granted Critical
Publication of CN111629267B publication Critical patent/CN111629267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47217End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for controlling playback functions for recorded or on-demand content, e.g. using progress bars, mode or play-point indicators or bookmarks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8543Content authoring using a description language, e.g. Multimedia and Hypermedia information coding Expert Group [MHEG], eXtensible Markup Language [XML]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an audio labeling method, an audio labeling device, audio labeling equipment and a computer readable storage medium. The method comprises the following steps: displaying the identity information of the marked object, the video containing the marked object and the audio graphics corresponding to the audio playing progress of the video; synchronously playing the video and the audio graph, and detecting an audio period matched with the identity information of the labeling object in the audio graph; and generating an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object. According to the technical scheme, the voice marking efficiency can be greatly improved.

Description

Audio labeling method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to an audio labeling method, an apparatus, a device, and a computer readable storage medium.
Background
Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. For example, in the technical field of voiceprint recognition, the identity of a speaker is automatically recognized through machine learning, so that the recognition process of the identity of the speaker is more intelligent.
In order to obtain a machine learning model for automatically identifying the identity of a speaker, a large number of labeling corpuses are required to train the machine learning model until the machine learning model has a better speaker identification effect. Therefore, how to conveniently and rapidly obtain the audio annotation corpus is a technical problem to be solved in the prior art.
Disclosure of Invention
In order to solve the technical problems, embodiments of the present application provide an audio labeling method, an apparatus, a device, and a computer readable storage medium, where the embodiments of the present application can quickly obtain an audio labeling corpus.
The technical scheme adopted by the application is as follows:
an audio labeling method, comprising: displaying the identity information of the marked object, the video containing the marked object and the audio graphics corresponding to the audio playing progress of the video; synchronously playing the video and the audio graph, and detecting an audio period matched with the identity information of the labeling object in the audio graph; and generating an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object. .
An audio annotation device comprising: the information display module is used for displaying the identity information of the marked object, the video containing the marked object and the audio graphics corresponding to the audio playing progress of the video; the play detection module is used for synchronously playing the video and the audio graphics and detecting an audio time period matched with the identity information of the labeling object in the audio graphics; and the information generation module is used for generating an audio annotation corpus according to the audio corresponding to the audio time period and the identity information of the annotation object.
An audio labeling apparatus comprising a processor and a memory having stored thereon computer readable instructions which, when executed by the processor, implement an audio labeling method as described above.
A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform an audio tagging method as described above.
In the technical scheme, the identification information of the labeling object and the video containing the labeling object are displayed, and the audio graphics corresponding to the audio playing progress of the video are displayed, so that the audio corresponding to the labeling object can be rapidly labeled by combining the identification information of the labeling object, the video containing the labeling object and the audio information of the video in the voice labeling process. According to the method and the device, the audio time period matched with the identity information of the labeling object is detected in the audio graph, and the audio labeling corpus is generated according to the detected audio corresponding to the audio time period and the identity information of the labeling object, so that the audio labeling corpus can be intelligently generated, and the acquisition efficiency of the audio labeling corpus is greatly improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 is a schematic illustration of one implementation environment to which the present application relates;
FIG. 2 is a flowchart illustrating an audio annotation method according to an example embodiment;
FIG. 3 is a flow chart of step 110 in an exemplary embodiment of the embodiment shown in FIG. 2;
FIG. 4 is a schematic diagram of an audio annotation interface, according to an example embodiment;
FIG. 5 is a schematic diagram of an audio annotation interface, according to another exemplary embodiment;
FIG. 6 is a flow chart of step 111 in an exemplary embodiment of the embodiment shown in FIG. 3;
FIG. 7 is a flowchart illustrating an audio annotation interface for data display, according to an exemplary embodiment;
FIG. 8 is a flow chart of step 130 in an exemplary embodiment of the embodiment shown in FIG. 2;
FIG. 9 is a schematic diagram of an audio annotation interface, according to another exemplary embodiment;
FIG. 10 is a flowchart illustrating an audio annotation method according to another exemplary embodiment;
FIG. 11 is a schematic diagram illustrating an audio annotation process according to an example embodiment;
FIG. 12 is a block diagram of an audio tagging apparatus, according to an example embodiment;
fig. 13 is a schematic diagram illustrating a structure of an audio labeling apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
Referring to fig. 1, fig. 1 is an exemplary diagram of an implementation environment to which the present application relates.
As shown in fig. 1, the implementation environment includes a terminal 100 and a server 200, where the terminal 100 and the server 200 pre-establish a communication connection for data interaction. The terminal 100 is provided with an audio annotation program, which provides an audio annotation interface for displaying information required for audio annotation. The server 200 is configured to provide a data service for an audio annotation program running in the terminal 100, for example, the terminal 100 needs to obtain information required by an audio annotation from the server 200 for displaying, and an audio annotation corpus generated by the terminal 100 is stored in the server 200.
It should be noted that, the terminal 100 may be a terminal device such as a computer, a notebook computer, a tablet, etc., and the server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud database, cloud computing, cloud functions, cloud storage, big data, and an artificial intelligence platform. .
Fig. 2 is a flow chart illustrating an audio annotation method that may be applied to the terminal 100 in the implementation environment shown in fig. 1, according to an exemplary embodiment. As shown in fig. 2, in an exemplary embodiment, the audio labeling method at least includes the following steps:
Step 110, displaying the identity information of the labeling object, the video containing the labeling object and the audio graphics corresponding to the audio playing progress of the video.
As described above, in order to obtain the machine learning model for automatically identifying the speaker, a large amount of audio annotation corpus is required to train the machine learning model until the machine learning model has a better speaker identification effect.
The audio annotation corpus is usually annotated manually by an annotator, for example, after the annotator obtains an audio file to be annotated, the annotator identifies the identity of a speaker by playing the audio file, and then annotates the audio file according to the identity of the speaker, thereby obtaining an audio annotation corpus. However, it is very difficult to distinguish the identity of the speaker only by the hearing of the labeling person, and the labeling accuracy is not high, so that the obtaining efficiency of the audio labeling corpus is very low.
In order to solve the above problems, the present embodiment provides an audio labeling method, which is capable of quickly performing audio labeling by combining multidimensional information, has high labeling accuracy, and is suitable for obtaining a large amount of audio labeling corpuses.
In this embodiment, the labeling object refers to a speaker to be identified, and the identity information of the labeling object is used to describe the identity of the speaker, for example, including information such as the name and image of the speaker. The video containing the annotation object refers to an image picture containing the annotation object in the video, and the audio graphics are used for describing the audio playing progress of the video.
It should be understood that video files typically contain video data and audio data, and that audio data is played in synchronization with the video data, so that the audio playing progress displayed in the audio graphics is synchronized with the video playing progress during the video playing process.
The audio graphics are displayed according to the audio data, for example, the audio graphics are displayed in the form of a progress bar, the total length of the progress bar corresponds to the total duration of the video, and the progress change in the progress bar is continuously adjusted in the playing process of the video. The audio graphics may also be displayed as an audio waveform, which may not only display the audio playing progress of the video, but also display the fluctuation of the audio sampling rate, which is not limited herein.
As described above, the audio annotation corpus related to the annotation object can be generated based on the identity information of the annotation object and the voice data corresponding to the object to be annotated, and in this embodiment, the voice data of the annotation object needs to be obtained according to the displayed identity information of the annotation object, the video and the audio graphics containing the annotation object because the identity information of the annotation object is known.
In one embodiment, the identity information of the annotation object, the video containing the annotation object, and the audio graphics corresponding to the audio playing progress of the video are displayed in different areas of an audio annotation interface, which is a visual display interface to quickly perform audio annotation based on the identity information of the annotation object, the video, and the audio graphics displayed on the interface. For example, the labeling personnel can distinguish the voice of the labeling object in multiple directions according to the identity information, the video and the audio image of the labeling object displayed on the audio labeling interface, so that the speed and the accuracy of the audio labeling are improved.
Step 130, synchronously playing the video containing the labeling object and the audio graphics corresponding to the audio playing progress of the video, and detecting the audio time period matched with the identity information of the labeling object in the audio graphics.
As described above, the audio playing progress of the audio graphic display is synchronized with the playing progress of the video containing the annotation object, that is, the audio playing progress of the audio graphic display is synchronized with the playing progress of the video at the same playing time. For any one of the play times of the video play, the audio play progress displayed in the audio graphic will correspond to the same play time.
Therefore, in the playing process of the video, the audio graphic continuously changes the real-time playing progress along with the playing of the video, so that the audio playing progress displayed by the audio graphic and the playing progress of the video correspond to the same playing time, and the audio graphic and the video are ensured to be synchronously played. Thus, the play progress change section displayed in the audio graphic is referred to as an audio period in the audio graphic.
And because the audio graphics are displayed according to the audio data corresponding to the video, the audio time period in the audio graphics corresponds to one section of data in the audio data, so that the audio time period matched with the identity information of the labeling object in the audio graphics corresponds to the voice data of the labeling object contained in the audio data corresponding to the video.
Therefore, in the playing process of the video, the audio time period corresponding to the voice data of the labeling object can be positioned in the audio graph by identifying the voice data corresponding to the labeling object contained in the synchronously played audio data.
Illustratively, during the playing of the video, by identifying the degree of similarity between the played audio and the voice features of the tagged object, the audio period during which the audio similar to the voice features of the tagged object is continuously played is determined as the audio period that matches the identity information of the tagged object.
The voice characteristics of the labeling object can comprise tone, tone and other information, and the voice characteristic information can be obtained according to the displayed identity information of the labeling object and the video containing the labeling object. For example, in the case where the identity information of the annotation object is known, a video picture in which the speaker annotates the object is acquired in advance from the video, and the tone and pitch of the annotation object are determined from the audio played in synchronization with the video picture.
In another embodiment, the matching degree between the playing picture and the image of the labeling object is identified in the playing process of the video according to the displayed image of the labeling object, and the audio period corresponding to the video picture continuously displayed and matched with the image of the labeling object is determined as the audio period matched with the identity information of the labeling object.
In yet other embodiments, the audio period that matches the identity information of the labeling object is determined in conjunction with manual manipulation by the labeling personnel. The annotators acquire voice characteristics such as tone and pitch of the annotating object in advance by playing the video according to the displayed identity information of the annotating object and the video containing the annotating object, then replay the video, and determine the audio time period corresponding to the voice data of the annotating object according to the acquired voice characteristics in advance.
For example, when starting playing and ending playing the voice of the labeling object, the labeling personnel inputs a preset audio selection instruction through a device such as a mouse or a keyboard, and the audio graphics can display a corresponding starting playing progress point and ending playing progress point when the audio selection instruction is input, so that an audio period between the starting playing progress point and the ending playing progress point can be determined to be an audio period matched with the identity information of the labeling object.
It should be noted that, the above manner of detecting the audio period matching with the identity information of the labeling object is merely an example, and does not represent that the present embodiment limits the manner of detecting the audio period.
The voice characteristics of the marked object are obtained in advance through the displayed video, the audio image and the identity information of the marked object, and the audio time period matched with the identity information of the marked object is accurately positioned in the audio image along with the playing of the video, so that the voice data of the marked object can be accurately positioned from the audio data corresponding to the video, and further accurate marked corpus is obtained.
And step 150, generating an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object.
As described above, the audio time period in the audio graphics, which is matched with the identity information of the labeling object, corresponds to the voice data of the labeling object, so that the voice data of the labeling object can be labeled according to the identity information of the labeling object, and further the audio labeling corpus corresponding to the labeling object is generated.
The method comprises the steps of positioning a corresponding section of audio data in audio data corresponding to a video based on an audio period matched with the identity information of the labeling object in the audio graph, wherein the section of audio data is voice data corresponding to the labeling object, and then storing the positioned audio data and the identity information of the labeling object in a correlated manner, namely labeling the voice data of the labeling object, so that voice labeling information is obtained.
In one embodiment, the audio data corresponding to the video contains voice data of a plurality of segments of annotation objects, so that in the process of playing the video, a plurality of audio time periods matched with the identity information of the annotation objects are detected in the audio graph, and each audio annotation corpus is generated according to each audio time period and the identity information of the annotation objects. That is, the embodiment can conveniently and rapidly obtain a plurality of audio annotation corpora about the annotation objects.
In another embodiment, the video contains a plurality of annotation objects, the identity information of each annotation object is correspondingly displayed, during the playing process of the video, an audio period matched with the identity information of each annotation object is detected in the audio graph, and an audio annotation corpus is generated according to the detected audio period and the identity information of the corresponding annotation object. Therefore, the embodiment can also conveniently and rapidly obtain a plurality of audio annotation corpuses, and the annotation objects corresponding to each audio annotation corpus may be different.
Therefore, in the embodiment, through displaying the identity information of the labeling object, the video containing the labeling object and the audio graphics corresponding to the audio playing progress of the video, in the synchronous playing process of the video and the audio graphics, the audio time period matched with the identity information of the labeling object is detected in the audio graphics, and the audio labeling corpus is generated according to the detected audio time period and the identity information of the labeling object, the audio data respectively corresponding to different labeling objects can be identified by combining multiparty information in the audio labeling process, or multiple segments of audio data of the labeling object can be identified, the audio labeling corpus is automatically generated, the audio labeling efficiency is greatly improved, and the method and the device can be suitable for obtaining a large number of audio labeling corpora.
FIG. 3 is a flow chart of step 110 in an exemplary embodiment of the embodiment shown in FIG. 2. As shown in fig. 3, in an exemplary embodiment, step 110 includes at least the steps of:
step 111, the identity information of the labeling object is obtained, and the video data of the labeling object and the audio sampling rate data corresponding to the video are included.
As described above, the identity information of the labeling object may include information such as a name and an image, and the image of the labeling object contains the face feature of the labeling object, so that the identity of the labeling object displayed in the video can be accurately identified according to the image of the labeling object.
By way of example, the image of the annotation object may include image formats such as jpeg (Joint Photographic Expert Group, federated photo expert group), png (Portable Network Graphics ), and the like, without limitation.
The video data containing the annotation object may be in a video format such as mp4 (a set of compression coding standards for audio and video information), MPEG (Moving Picture Experts Group, moving picture expert group format), etc., which is not limited herein.
The audio sample rate data corresponding to the video data may be extracted from the video data by FFmpeg (which is a set of open source computer programs that can be used to record, convert digital audio, video, and convert it into streams).
In one embodiment, the identity information of the labeling object, the video data containing the labeling object, and the audio sampling rate data corresponding to the video are stored locally, for example, before audio labeling, the data are obtained in advance through copying, downloading, etc., and the data are stored locally, so that the data can be obtained from the local when audio labeling is performed.
In another embodiment, the identity information of the labeling object, the video data containing the labeling object, and the audio sampling rate data corresponding to the video are stored in the server, so that the data needs to be queried from the server.
It should be noted that, the data may be obtained by querying from a server side or locally at the same time, or may be obtained by querying separately, which is not limited herein.
Step 113, performing video display according to the video data, performing identification information display of the labeling object, and drawing an audio graph corresponding to the playing progress of the video according to the audio sampling rate data.
For the display of the video data and the identification information of the annotation object, the identification information of the annotation object and the video data may be configured as HTML (Hyper Text Markup Language, hypertext markup language, a grammar rule for marking how the web page information is displayed and other characteristics) tags matched with respective data types, and then the display of the identification information of the annotation object and the display of the video containing the annotation object may be performed according to the configured HTML tags.
By way of example, the name of the annotation object may be configured as a < table > tag, the image of the annotation object as a < img > tag, the video data as a < video > tag, and the display of the corresponding data in the annotation interface may be performed via these HTML tags.
Drawing of the audio graphics may be accomplished by calling Canvas API (Canvas application interface). The Canvas API is a technology for generating images in real time in a web page and manipulating the contents of the images, and renders audio graphics on an audio markup interface using JavaScript (a lightweight, interpreted or just-in-time compiled programming language with functional preference) through the Canvas element of HTML5 (which is a version of the hypertext markup language).
FIG. 4 is a schematic diagram of an audio annotation interface, according to an example embodiment. As shown in fig. 4, the audio labeling interface is divided into several interface areas, and according to the HTML tag configured above, the image and name of the labeling object, and the video containing the labeling object can be displayed in the corresponding interface areas, and according to the audio sampling rate data corresponding to the video, the audio graphics are drawn in the corresponding interface areas.
It is understood that the audio graph drawn based on the audio sampling rate data is an audio waveform graph, the audio waveform graph reflects the fluctuation condition of the audio sampling rate, and experienced labeling personnel can identify the audio time period matched with the identity information of the labeling object according to the fluctuation condition, so that the corpus labeling efficiency of the labeling personnel is improved.
In yet other embodiments, the audio sample rate data is obtained by sampling the audio of the annotation object from the video data, so that only the audio fluctuations of the annotation object are displayed in the audio graphics.
For example, in the audio annotation interface shown in fig. 5, the audio waveform diagram is drawn according to the audio sampling rate data, only the audio sampling rate fluctuation condition of the annotation object is displayed, and as the video is played, the cursor used for identifying the real-time playing progress point in the audio diagram also changes synchronously. The labeling personnel can more easily recognize the voice data of the labeling object according to the audio labeling interface shown in fig. 5, and the efficiency and accuracy of audio labeling can be further improved.
Fig. 6 is a flow chart of step 111 in an exemplary embodiment of the embodiment shown in fig. 3. As shown in fig. 6, step 111 may include the steps of:
step 210, detecting an input labeling object identifier;
step 230, initiating a data query request to the server according to the detected labeling object identifier;
step 250, receiving a query result returned by the server according to the data query request, wherein the query result contains video data, identity information and audio sampling rate data corresponding to the labeling object identifier.
Firstly, it should be noted that in this embodiment, the identity information of the labeling object, the video data containing the labeling object, and the audio sampling rate data corresponding to the video are all stored in association in the server based on the labeling object identifier, so that the data needs to be queried from the server according to the labeling object identifier to display the data.
The labeling object identifier may be an ID (identity document) of the labeling object, or information such as a name of the labeling object, which is not limited in this regard.
To determine the data query target, the annotation interface will detect the input annotation object identification. The audio labeling interface is provided with a data query area, and labeling personnel can initiate a data query request to the server by inputting identification information of a labeling object to be labeled in the data query area. The server side can find the identity information, video data and audio sampling rate data stored in association with the identification object identification in the database according to the received identification object identification, and returns the data to the audio annotation interface for display.
In one embodiment, to reduce redundant information in the database of the server, the identity information of the labeling object, the video data containing the labeling object, and the audio sampling rate data corresponding to the video may be stored in the server in a structured manner through two data tables.
Illustratively, one of the data tables is used to store basic information of the annotation object, such as the annotation object ID, name, and image shown in table 1 below.
ID Name of name Image processing apparatus
1 *** ***.jpg
2 *** ***.jpg
TABLE 1
The other data table is used for storing video data containing the labeling objects and audio sampling rate data corresponding to the video, and storing the corresponding relation between the video and the labeling objects, as shown in the following table 2:
ID video data Audio sample rate data
1 ***.mp4 ***
2 ***.mp4 ***
1 ***.mp4 ***
1 ***.mp4 ***
TABLE 2
As can be seen from table 2, at least one video data is stored in association with an ID of an annotation object.
The server side queries the data related to the labeling object in each data table through table connection according to the ID of the labeling object, and can obtain the result shown in the table 3:
ID name of name Image processing apparatus Video data Audio sample rate data
1 *** ***.jpg ***.mp4 ***
1 *** ***.jpg ***.mp4 ***
1 *** ***.jpg ***.mp4 ***
2 *** ***.jpg ***.mp4 ***
TABLE 3 Table 3
Therefore, the server returns the name and the image of the labeling object, at least one video data containing the labeling object and the audio sampling rate data corresponding to each video data according to the ID of the labeling object contained in the query request.
In another exemplary embodiment, in order to ensure the security of the database, the server is further provided with a database access right, so that related data needs to be queried from the server according to the account information for audio annotation registered in the audio annotation interface.
The audio annotation interface generates a data query request according to the input annotation object identification and the logged account information for audio annotation, and sends the data query request to the server, so that the server returns related data stored in association with the annotation object identification in the database to the audio annotation interface after verifying that the account information has the database access right. And the audio annotation interface can execute the display of the related data according to the returned related data.
An exemplary flowchart of the audio annotation interface for displaying related data according to the data query process provided in this embodiment is shown in fig. 7. The server side queries all relevant data to be subjected to audio annotation from each data table stored in the database through table connection, and queries data related to the annotation object according to the annotation object identification contained in the data query request, wherein the data comprise the name, the image, the video data containing the annotation object and the audio sampling rate data extracted from the video data. The labeling interface configures corresponding HTML labels according to the data, performs corresponding display according to the configured HTML, and draws audio graphics according to the audio sampling rate data. The related information displayed in the audio annotation interface can provide a way for the annotators to reliably recognize the voice of the annotating object, so that the efficiency of the annotators for audio annotation is greatly improved.
Fig. 8 is a flow chart of step 130 in an exemplary embodiment of the embodiment shown in fig. 2. As shown in fig. 8, in one exemplary embodiment, step 130 may include the steps of:
step 131, synchronously playing video and audio graphics when a video playing instruction is detected;
firstly, it should be noted that the video playing instruction is a preset instruction for indicating playing or pausing of the video and audio graphics.
The video playing command may be input through a device such as a mouse or a keyboard, or may be input through touching an audio annotation interface on which video, audio graphics and identity information of the annotation object are displayed, which is not limited in this embodiment. For example, when the annotator clicks on a video play/pause button displayed in the audio annotation interface, it is considered that a video play instruction is detected.
Step 133, when the audio selection instruction is detected, locating an audio period in the audio graphic that matches the identity information of the annotation object.
The audio selection instruction is also a preset instruction, and is used for positioning the starting position and the ending position of the audio time period matched with the identity information of the labeling object in the audio graph. When the number of the labeling objects is multiple, different audio selection instructions can be set for each labeling object respectively so as to distinguish different labeling objects according to the audio selection instructions, thereby ensuring that the labeling object identity in the generated audio labeling corpus is accurate and further ensuring the accuracy of the audio labeling corpus.
The audio selection command may be input through a device such as a mouse or a keyboard, or through touching an annotation interface on which video, audio graphics, and identity information of an annotation object are displayed, which is not limited in this embodiment.
For example, in a specific embodiment, in the video playing process, when the annotator recognizes that the currently played audio is the voice of the annotating object, the annotator inputs an audio selection instruction by pressing a carriage return key of the keyboard, and inputs the audio selection instruction by pressing the carriage return key again when the voice of the annotating object is played, so that the starting position and the ending position of the audio period can be determined based on the audio selection instruction input twice before and after, and the audio period matched with the identity information of the annotating object is obtained.
To locate an audio period in the audio graphic that matches the identification information of the annotation object, it is necessary to locate a start position and an end position of the audio period in the audio graphic according to the audio selection instruction, so that the audio period located between the start position and the end position is determined as the audio period that matches the identification information of the annotation object.
In one embodiment, when the audio selection instruction is detected an odd number of times, the real-time play progress point displayed in the audio graphic is determined as a start position, and when the audio selection instruction is detected an even number of times, the real-time play progress point displayed in the audio graphic is determined as an end position corresponding to the start position determined when the audio selection instruction was detected last time.
For example, if the audio data corresponding to the video contains only one segment of voice data of the labeling object, the audio selection instruction input twice is detected, the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the first time is determined as a starting position, and correspondingly, the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the second time is determined as an ending position, and the time period between the starting position and the ending position is matched with the identity information of the labeling object.
Similarly, if the audio data corresponding to the video contains voice data of a plurality of sections of marked objects, determining a real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the odd number of times as a starting position, and determining a real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the even number of times as an ending position corresponding to the starting position determined when the audio selection instruction is input for the previous time. Thus, for each determined starting position, a corresponding ending position is obtained, and a plurality of audio time periods matched with the identity information of the labeling object are obtained.
If the audio data corresponding to the video contains voice data of a plurality of annotation objects, when an input audio selection instruction is detected, the corresponding annotation objects are determined according to the audio selection instruction, so that the identity information of the annotation objects corresponding to each audio period is obtained while a plurality of audio periods are obtained.
In another embodiment, after the audio period matched with the identity information of the labeling object is located in the audio graph according to the audio selection instruction, the located audio period can be further fine-tuned, so that the finally determined audio period is more accurate.
As shown in fig. 9, in the audio graphics displayed on the audio labeling interface, the starting position and the ending position of the audio period are marked by using marking lines, so that a marker can pull the marking lines at two ends of the audio period through a mouse to perform fine adjustment of the audio period.
Therefore, in the embodiment, by detecting the input audio selection instruction in the video playing process and positioning the audio time period matched with the identity information of the labeling object in the audio graph according to the detected audio selection instruction, at least one section of voice data of at least one labeling object can be accurately positioned in the audio data corresponding to the video, so that the acquisition scene of a large number of audio labeling corpuses is satisfied.
Fig. 10 is a flowchart illustrating an audio annotation method according to another exemplary embodiment. As shown in fig. 10, in an exemplary embodiment, the audio labeling method further includes the steps of:
Step 310, capturing a target image containing a speaker picture from a pre-collected video;
in step 330, the speaker is taken as the labeling object contained in the video, and the target image is taken as the identity information of the labeling object.
Considering that the acquisition of the audio annotation corpus is also related to the collection process of the video data, if the collection efficiency of the video data can be improved, the acquisition efficiency of the audio annotation corpus can be improved to a certain extent, so that the embodiment is improved aiming at the collection process of the video data.
In order to quickly obtain a large amount of video, a web crawler (also called a web spider or a web robot, which is a program or script for automatically capturing web information according to a certain rule) is generally used to automatically collect video data in a network, but labeling objects contained in the video data need to be manually added by a person responsible for video collection, which is quite inefficient.
When video is recorded, a camera is usually aimed at a target person when the target person speaks, so that in the process of playing the recorded video, a speaker picture and speaker audio are synchronously played, and a viewer can obtain better viewing experience.
Based on this, the embodiment intercepts the target image containing the speaker picture from the video collected in advance, takes the speaker as the labeling object contained in the video, and takes the target image as the identity information of the labeling object, so that the identity information of the labeling object can be accurately obtained.
In one embodiment, face recognition is performed on each video collected in advance, and the video containing the face features is determined so as to intercept the target image from the video containing the face features. Specifically, face recognition is performed on each collected video in advance, if the video does not contain face features, the fact that the labeling object cannot be determined for the video is indicated, and the videos are not used as videos to be subjected to audio labeling. For videos containing face features, target images containing speaker pictures can be cut from the videos, and the videos are applied to follow-up audio labeling.
Therefore, the method and the device can automatically generate the identity information of the labeling object contained in the video aiming at the collected video, and the personnel responsible for video collection are not required to manually add the labeling object, so that the audio labeling efficiency is improved as a whole. The embodiment can also rapidly filter the video without speaker pictures, and can ensure that the video data for the subsequent audio annotation is effective.
In order to facilitate understanding of the technical essence of the present application, the audio labeling method proposed in the above embodiment will be described in detail below with a specific application scenario.
An exemplary audio annotation flow is shown in fig. 11, and includes three stages of displaying a data source to be annotated, annotating the data, and obtaining an annotation result. The data source to be annotated comprises the image and name of the annotated object and the video and audio sample data containing the annotated object. As shown in fig. 4, 5 and 9, these data sources to be annotated are all displayed in the audio annotation interface of the terminal, and the audio sampling rate is specifically shown as an audio waveform chart. The method comprises the steps that a labeling person determines an audio period corresponding to a labeling object in the playing process of a video, and a terminal cuts out an audio segment in an audio waveform chart according to the audio period determined by the labeling person, so that a labeling result is generated according to the audio segment obtained by cutting and the name of the labeling object. It should be understood that slicing the audio segments is the process of obtaining corresponding audio data according to the determined audio period.
Specifically, an audio annotation account is logged in an audio annotation interface of the terminal, an annotator operates the audio annotation interface to enable the terminal to inquire a data source to be annotated from a server, and the server returns the related data source to the terminal after verifying that the audio annotation account has the inquiry authority.
After the terminal receives the related data sources, each data source is read in sequence according to the data sequence, the image, the name and the first frame image of the video containing the marked object are displayed in the audio marking interface, and an audio waveform diagram is drawn on the audio marking interface according to the audio sampling rate data.
When the terminal receives a video playing instruction clicked by the labeling personnel, the audio labeling interface starts to play the video, and the cursor in the audio graph and the playing time of the video synchronously walk. When the video is played to the audio time period of the object to be marked, the marking personnel rapidly cuts the audio time period by pressing an enter key on a keyboard, the terminal obtains a starting time point and an ending time point for expressing the object to be marked according to the cut audio time period, and marks the name of the object to be marked in the time period between the starting time point and the ending time point, so that an audio marking corpus can be generated, and the marking work of the video is completed.
The generated audio annotation corpus is an xml (eXtensible Markup Language ) file, for example:
< Turn startime= "64.972826" endtime= "94.993206" type= "Zhang san" >.
If the speaker identity is distinguished and the audio labeling is carried out only by playing the audio file, labeling personnel can finish labeling of the audio in 10 minutes per hour on average, and after the method is adopted, labeling personnel can finish labeling of the audio in 25-35 minutes per hour on average, and the labeling accuracy is improved from 80% to 95%.
Therefore, the audio labeling method provided by the application can enable labeling staff to rapidly finish labeling work of a large number of videos, and the labeling efficiency is greatly improved.
Fig. 12 is a block diagram illustrating an audio tagging apparatus that may be adapted for use with the terminal 100 in the implementation environment shown in fig. 1, according to an example embodiment. As shown in fig. 12, the audio labeling apparatus includes an information display module 410, a play detection module 430, and an information generation module 450.
The information display module 410 is configured to display identity information of the labeling object, a video containing the labeling object, and an audio graphic corresponding to an audio playing progress of the video. The play detection module 430 is used for synchronously playing video and audio graphics, and detecting an audio period matched with the identity information of the labeling object in the audio graphics. The information generating module 450 is configured to generate an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object.
In another exemplary embodiment, the play detection module 430 includes an instruction detection unit and an audio localization unit. The instruction detection unit is used for synchronously playing video and audio graphics when the video playing instruction is detected. The audio positioning unit is used for positioning an audio time period matched with the identity information of the labeling object in the audio graph when the audio selection instruction is detected.
In another exemplary embodiment, the audio locating unit includes a position locating subunit and a match determining subunit. The position locating subunit is used for locating the starting position and the ending position of the audio period in the audio graph according to the audio selection instruction. The matching determination subunit is configured to determine an audio period between the start position and the end position as an audio period that matches the identity information of the annotation object.
In another exemplary embodiment, the position location subunit includes a start position determination subunit and an end position determination subunit. The start position determining subunit is configured to determine, as a start position, a real-time playing progress point displayed in the audio graphics when the audio selection instruction is detected an odd number of times. The end position determining subunit is configured to determine, when the audio selection instruction is detected an even-numbered time, a real-time play progress point displayed in the audio graphic as an end position corresponding to a start position determined when the audio selection instruction was detected last time.
In another exemplary embodiment, the information display module 410 includes a sampling rate acquisition unit and a graphics rendering unit. The sampling rate acquisition unit is used for acquiring audio sampling rate data corresponding to the video. The graphics drawing unit is used for drawing audio graphics corresponding to the audio playing progress of the video according to the audio sampling rate data.
In another exemplary embodiment, the information display module 410 further includes a data acquisition unit, a tag configuration unit, and a tag display unit. The data acquisition unit is used for acquiring the identity information of the marked object and the video data containing the marked object. The tag configuration unit is used for respectively configuring the identification information of the labeling object and the video data into HTML tags matched with respective data types. The tag display unit is used for displaying the identity information of the marked object according to the HTML tag and displaying the video containing the marked object.
In another exemplary embodiment, the data acquisition unit includes an identification detection subunit, a request transmission subunit, and a result reception subunit. The mark detection subunit is used for detecting the input mark object mark. The request sending subunit is used for initiating a data query request to the server according to the detected labeling object identification and the account information for audio labeling. The result receiving subunit is used for receiving a query result returned by the server according to the data query request, wherein the query result contains video data corresponding to the labeling object identifier and identity information.
In another exemplary embodiment, the apparatus further comprises an image capture module and an information acquisition module. The image intercepting module is used for intercepting a target image containing a speaker picture from a pre-collected video. The information acquisition module is used for taking a speaker as a labeling object contained in the video and taking a target image as identity information of the labeling object.
In another exemplary embodiment, the image capture module includes a face recognition unit and a feature capture unit. The face recognition unit is used for recognizing the face of each video collected in advance and determining the video containing the face characteristics. The feature intercepting unit is used for intercepting a target image from the video containing the face features.
In another exemplary embodiment, the information generation module 450 includes an audio data acquisition unit and an associated storage unit. The audio data acquisition unit is used for acquiring audio data corresponding to the audio time period, and the association storage unit is used for carrying out association storage on the audio data and the identity information of the labeling object to acquire an audio labeling corpus.
It should be noted that, the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the method embodiment, which is not repeated herein.
Embodiments of the present application also provide an audio labeling apparatus comprising a processor and a memory, wherein the memory has stored thereon computer readable instructions that when executed by the processor implement an audio labeling method as described above.
Fig. 13 is a schematic diagram illustrating a structure of an audio labeling apparatus according to an exemplary embodiment.
It should be noted that the audio labeling device is only an example adapted to the present application, and should not be construed as providing any limitation on the scope of use of the present application. Nor should the audio tagging device be construed as necessarily relying on or necessarily having one or more of the components of the exemplary audio tagging device shown in fig. 13.
As shown in FIG. 13, in an exemplary embodiment, the audio tagging apparatus includes a processing component 501, a memory 502, a power supply component 503, a multimedia component 504, an audio component 505, a sensor component 507, and a communication component 508. The above components are not required, and the audio labeling device may add other components or reduce some components according to its own functional requirement, which is not limited in this embodiment.
The processing component 501 generally controls overall operation of the audio annotation device, such as operations associated with display, data communication, and log data processing, among others. The processing component 501 may include one or more processors 509 to execute instructions to perform all or part of the steps of the operations described above. Further, the processing component 501 can include one or more modules that facilitate interactions between the processing component 501 and other components. For example, the processing component 501 may include a multimedia module to facilitate interaction between the multimedia component 504 and the processing component 501.
The memory 502 is configured to store various types of data to support operation at the audio annotation device, examples of which include instructions for any application or method operating on the audio annotation device. The memory 502 has stored therein one or more modules configured to be executed by the one or more processors 509 to perform all or part of the steps of the audio annotation method described in the above embodiments.
The power supply component 503 provides power to the various components of the audio tagging apparatus. The power components 503 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the audio tagging devices.
The multimedia component 504 includes a screen between the audio annotation device and the user that provides an output interface. In some embodiments, the screen may include a TP (Touch Panel) and an LCD (Liquid Crystal Display ). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.
The audio component 505 is configured to output and/or input audio signals. For example, the audio component 505 includes a microphone configured to receive external audio signals when the audio tagging device is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. In some embodiments, the audio component 505 further comprises a speaker for outputting audio signals.
The sensor assembly 507 includes one or more sensors for providing status assessment of various aspects of the audio annotation device. For example, the sensor assembly 507 may detect an on/off state of the audio marking device and may also detect a temperature change of the audio marking device.
The communication component 508 is configured to facilitate communication between the audio annotation device and other devices, either wired or wireless. The audio tagging device may access a Wireless network based on a communication standard, such as Wi-Fi (Wireless-Fidelity).
It will be appreciated that the configuration shown in fig. 13 is merely illustrative and that the audio tagging apparatus may include more or fewer components than shown in fig. 13 or have different components than shown in fig. 13. Each of the components shown in fig. 13 may be implemented in hardware, software, or a combination thereof.
Another aspect of the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an audio tagging method as described above. The computer-readable storage medium may be contained in the audio tagging apparatus described in the above embodiment or may exist alone without being assembled into the audio tagging apparatus.
The foregoing is merely a preferred exemplary embodiment of the present application and is not intended to limit the embodiments of the present application, and those skilled in the art may make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. An audio labeling method, comprising:
displaying an audio annotation interface, wherein identity information of an annotation object, a video containing the annotation object and an audio graph corresponding to the audio playing progress of the video are respectively displayed in different areas on the audio annotation interface;
synchronously playing the video and the audio graph, and detecting an audio period matched with the identity information of the labeling object in the audio graph;
Acquiring audio corresponding to the audio period from the audio data corresponding to the video, and generating an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object;
wherein the synchronously playing the video and the audio graphics, and detecting the audio time period matched with the identity information of the labeling object in the audio graphics comprises the following steps:
synchronously playing the video and the audio graphics;
in the process of displaying the identity information of the labeling object and synchronously playing the video and the audio graphics, if an audio selection instruction is detected, positioning the starting position and the ending position of an audio period in the audio graphics according to the audio selection instruction; an audio period between the start position and the end position is determined as an audio period that matches the identity information of the annotation object.
2. The method of claim 1, wherein playing the video and the audio graphics synchronously comprises:
and synchronously playing the video and the audio graphics when a video playing instruction is detected.
3. The method of claim 1, wherein locating a start position and an end position of an audio period in the audio graphic according to the audio selection instruction comprises:
When the audio selection instruction is detected in the odd number, determining a real-time playing progress point displayed in the audio graph as the starting position;
and when the audio selection instruction is detected for the even number, determining a real-time playing progress point displayed in the audio graph as an ending position corresponding to the starting position determined when the audio selection instruction is detected for the previous time.
4. The method of claim 1, wherein displaying an audio graphic corresponding to an audio playback progress of the video comprises:
acquiring audio sampling rate data corresponding to the video;
and drawing an audio graph corresponding to the audio playing progress of the video according to the audio sampling rate data.
5. The method of claim 1, wherein the audio graphic comprises an audio waveform map for displaying audio sample rate fluctuations and audio playback progress of the video.
6. The method of claim 1, wherein displaying the identity information of the tagged object and the video containing the tagged object comprises:
acquiring identity information of an annotation object and video data containing the annotation object;
Respectively configuring the identity information of the labeling object and the video data into HTML labels matched with respective data types;
and displaying the identity information of the marked object and displaying the video containing the marked object according to the HTML tag.
7. The method of claim 6, wherein obtaining the identity information of the tagged object, and the video data comprising the tagged object, comprises:
detecting an input labeling object identifier;
initiating a data query request to a server according to the detected labeling object identification and the account information for audio labeling;
and receiving a query result returned by the server according to the data query request, wherein the query result contains video data and identity information corresponding to the labeling object identifier.
8. The method according to claim 1, wherein the method further comprises:
intercepting a target image containing a speaker picture from a pre-collected video;
and taking the speaker as an annotation object contained in the video, and taking the target image as identity information of the annotation object.
9. The method of claim 8, wherein capturing a target image containing a speaker picture from a pre-collected video comprises:
Performing face recognition on each video collected in advance, and determining the video containing the face features;
and intercepting the target image from the video containing the face features.
10. The method of claim 1, wherein generating an audio annotation corpus from the audio corresponding to the audio period and the identity information of the annotation object comprises:
acquiring audio data corresponding to the audio time period;
and carrying out association storage on the audio data and the identity information of the labeling object to obtain the audio labeling corpus.
11. An audio labeling apparatus, comprising:
the information display module is used for displaying an audio annotation interface, and identity information of an annotation object, a video containing the annotation object and an audio graph corresponding to the audio playing progress of the video are respectively displayed in different areas on the audio annotation interface;
the play detection module is used for synchronously playing the video and the audio graphics and detecting an audio time period matched with the identity information of the labeling object in the audio graphics;
the information generation module is used for acquiring the audio corresponding to the audio period from the audio data corresponding to the video, and generating an audio annotation corpus according to the audio corresponding to the audio period and the identity information of the annotation object;
Wherein the synchronously playing the video and the audio graphics, and detecting the audio time period matched with the identity information of the labeling object in the audio graphics comprises the following steps:
synchronously playing the video and the audio graphics;
in the process of displaying the identity information of the labeling object and synchronously playing the video and the audio graphics, if an audio selection instruction is detected, positioning the starting position and the ending position of an audio period in the audio graphics according to the audio selection instruction; an audio period between the start position and the end position is determined as an audio period that matches the identity information of the annotation object.
12. An audio labeling apparatus, comprising:
a memory storing computer readable instructions;
a processor reading computer readable instructions stored in a memory to perform the method of any one of claims 1-10.
13. A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of any of claims 1-10.
CN202010371102.XA 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium Active CN111629267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371102.XA CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371102.XA CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111629267A CN111629267A (en) 2020-09-04
CN111629267B true CN111629267B (en) 2023-06-09

Family

ID=72259723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371102.XA Active CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111629267B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487238B (en) * 2020-10-27 2024-05-17 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
CN113096643A (en) * 2021-03-25 2021-07-09 北京百度网讯科技有限公司 Video processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6404925B1 (en) * 1999-03-11 2002-06-11 Fuji Xerox Co., Ltd. Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
CN101382937A (en) * 2008-07-01 2009-03-11 深圳先进技术研究院 Multimedia resource processing method based on speech recognition and on-line teaching system thereof
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109815360A (en) * 2019-01-28 2019-05-28 腾讯科技(深圳)有限公司 Processing method, device and the equipment of audio data
WO2020052405A1 (en) * 2018-09-10 2020-03-19 腾讯科技(深圳)有限公司 Corpus annotation set generation method and apparatus, electronic device, and storage medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007101945A (en) * 2005-10-05 2007-04-19 Fujifilm Corp Apparatus, method, and program for processing video data with audio
US20140348394A1 (en) * 2011-09-27 2014-11-27 Picsured, Inc. Photograph digitization through the use of video photography and computer vision technology
CN102779508B (en) * 2012-03-31 2016-11-09 科大讯飞股份有限公司 Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
US9704488B2 (en) * 2015-03-20 2017-07-11 Microsoft Technology Licensing, Llc Communicating metadata that identifies a current speaker
US10497382B2 (en) * 2016-12-16 2019-12-03 Google Llc Associating faces with voices for speaker diarization within videos
CN109660744A (en) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 The double recording methods of intelligence, equipment, storage medium and device based on big data
CN109361886A (en) * 2018-10-24 2019-02-19 杭州叙简科技股份有限公司 A kind of conference video recording labeling system based on sound detection
CN110008378B (en) * 2019-01-28 2024-03-19 平安科技(深圳)有限公司 Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN109814718A (en) * 2019-01-30 2019-05-28 天津大学 A kind of multi-modal information acquisition system based on Kinect V2
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN110427930A (en) * 2019-07-29 2019-11-08 中国工商银行股份有限公司 Multimedia data processing method and device, electronic equipment and readable storage medium storing program for executing
CN110600010B (en) * 2019-09-20 2022-05-17 度小满科技(北京)有限公司 Corpus extraction method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6404925B1 (en) * 1999-03-11 2002-06-11 Fuji Xerox Co., Ltd. Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
CN101382937A (en) * 2008-07-01 2009-03-11 深圳先进技术研究院 Multimedia resource processing method based on speech recognition and on-line teaching system thereof
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
WO2020052405A1 (en) * 2018-09-10 2020-03-19 腾讯科技(深圳)有限公司 Corpus annotation set generation method and apparatus, electronic device, and storage medium
CN109815360A (en) * 2019-01-28 2019-05-28 腾讯科技(深圳)有限公司 Processing method, device and the equipment of audio data

Also Published As

Publication number Publication date
CN111629267A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
US9058375B2 (en) Systems and methods for adding descriptive metadata to digital content
US20200065322A1 (en) Multimedia content tags
KR101810578B1 (en) Automatic media sharing via shutter click
US9799375B2 (en) Method and device for adjusting playback progress of video file
US7954049B2 (en) Annotating multimedia files along a timeline
US7653925B2 (en) Techniques for receiving information during multimedia presentations and communicating the information
US8930308B1 (en) Methods and systems of associating metadata with media
CN106971009B (en) Voice database generation method and device, storage medium and electronic equipment
CN110740389B (en) Video positioning method, video positioning device, computer readable medium and electronic equipment
US20150120816A1 (en) Tracking use of content of an online library
US9335838B2 (en) Tagging of written notes captured by a smart pen
US20090175599A1 (en) Digital Life Recorder with Selective Playback of Digital Video
JP2006155384A (en) Video comment input/display method and device, program, and storage medium with program stored
CN111629267B (en) Audio labeling method, device, equipment and computer readable storage medium
US20150116282A1 (en) Organizing Written Notes Using Contextual Data
CN105828179A (en) Video positioning method and device
JP2011180729A (en) Information processing apparatus, keyword registration method, and program
US20140156651A1 (en) Automatic summarizing of media content
CN104349173A (en) Video repeating method and device
CN110970011A (en) Picture processing method, device and equipment and computer readable storage medium
CN112601129B (en) Video interaction system, method and receiving terminal
CN111274449A (en) Video playing method and device, electronic equipment and storage medium
WO2015178014A1 (en) Learning support system, learning support server, learning support method, and learning support program
KR101328270B1 (en) Annotation method and augmenting video process in video stream for smart tv contents and system thereof
CN106778449B (en) Object identification method of dynamic image and interactive film establishment method for automatically capturing target image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40027486

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant