CN111629267A - Audio labeling method, device, equipment and computer readable storage medium - Google Patents

Audio labeling method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111629267A
CN111629267A CN202010371102.XA CN202010371102A CN111629267A CN 111629267 A CN111629267 A CN 111629267A CN 202010371102 A CN202010371102 A CN 202010371102A CN 111629267 A CN111629267 A CN 111629267A
Authority
CN
China
Prior art keywords
audio
video
annotation
identity information
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010371102.XA
Other languages
Chinese (zh)
Other versions
CN111629267B (en
Inventor
蒋亚雄
刘洪�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010371102.XA priority Critical patent/CN111629267B/en
Publication of CN111629267A publication Critical patent/CN111629267A/en
Application granted granted Critical
Publication of CN111629267B publication Critical patent/CN111629267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47217End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for controlling playback functions for recorded or on-demand content, e.g. using progress bars, mode or play-point indicators or bookmarks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8543Content authoring using a description language, e.g. Multimedia and Hypermedia information coding Expert Group [MHEG], eXtensible Markup Language [XML]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the application discloses an audio labeling method, an audio labeling device, audio labeling equipment and a computer-readable storage medium. The method comprises the following steps: displaying the identity information of the marked object, the video containing the marked object and an audio graphic corresponding to the audio playing progress of the video; synchronously playing the video and the audio graph, and detecting an audio time interval matched with the identity information of the marked object in the audio graph; and generating an audio labeling corpus according to the audio corresponding to the audio time period and the identity information of the labeling object. The technical scheme of the embodiment of the application can greatly improve the voice labeling efficiency.

Description

Audio labeling method, device, equipment and computer readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to an audio annotation method, apparatus, device, and computer-readable storage medium.
Background
Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. For example, in the technical field of voiceprint recognition, the identity of a speaker is automatically recognized through machine learning, so that the recognition process of the identity of the speaker is more intelligent.
In order to obtain a machine learning model for automatically identifying the identity of a speaker, a large amount of labeled corpora are required to be used for training the machine learning model until the machine learning model has a better speaker identification effect. Therefore, how to conveniently and quickly obtain the audio annotation corpora is a technical problem to be solved in the prior art.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present application provide an audio tagging method, apparatus, device and computer-readable storage medium.
Wherein, the technical scheme who this application adopted does:
an audio annotation method comprising: displaying the identity information of the marked object, the video containing the marked object and an audio graphic corresponding to the audio playing progress of the video; synchronously playing the video and the audio graph, and detecting an audio time interval matched with the identity information of the marked object in the audio graph; and generating an audio labeling corpus according to the audio corresponding to the audio time period and the identity information of the labeling object. .
An audio annotation device comprising: the information display module is used for displaying the identity information of the marked object, the video containing the marked object and the audio graphic corresponding to the audio playing progress of the video; the playing detection module is used for synchronously playing the video and the audio graph and detecting an audio time interval matched with the identity information of the marked object in the audio graph; and the information generation module is used for generating audio labeling linguistic data according to the audio corresponding to the audio time interval and the identity information of the labeling object.
An audio annotation device comprising a processor and a memory, the memory having stored thereon computer readable instructions which, when executed by the processor, implement an audio annotation method as described above.
A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the audio annotation method as described above.
In the technical scheme, the identity information of the tagged object and the video containing the tagged object are displayed, and the audio graph corresponding to the audio playing progress of the video is displayed, so that the audio corresponding to the tagged object can be quickly tagged by combining the identity information of the tagged object, the video containing the tagged object and the audio information of the video in the voice tagging process. The method and the device have the advantages that the audio time interval matched with the identity information of the labeled object is detected in the audio graph, and the audio labeling corpus is generated according to the audio corresponding to the detected audio time interval and the identity information of the labeled object, so that the audio labeling corpus can be generated intelligently, and the acquisition efficiency of the audio labeling corpus is greatly improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;
FIG. 2 is a flow diagram illustrating a method of audio annotation in accordance with an exemplary embodiment;
FIG. 3 is a flow chart of step 110 in one exemplary embodiment of the embodiment shown in FIG. 2;
FIG. 4 is a schematic diagram illustrating an audio annotation interface in accordance with an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating an audio annotation interface in accordance with another exemplary embodiment;
FIG. 6 is a flowchart of an exemplary embodiment of step 111 in the embodiment shown in FIG. 3;
FIG. 7 is a flow diagram illustrating the display of data by an audio annotation interface in accordance with an exemplary embodiment;
FIG. 8 is a flowchart of an exemplary embodiment of step 130 in the embodiment shown in FIG. 2;
FIG. 9 is a schematic diagram illustrating an audio annotation interface in accordance with another exemplary embodiment;
FIG. 10 is a flow chart illustrating a method of audio annotation in accordance with another exemplary embodiment;
FIG. 11 is a schematic diagram illustrating an audio annotation process according to an exemplary embodiment;
FIG. 12 is a block diagram illustrating an audio annotation device in accordance with an exemplary embodiment;
FIG. 13 is a block diagram illustrating an audio annotation device in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Referring to FIG. 1, FIG. 1 is an exemplary diagram of an implementation environment related to the present application.
As shown in fig. 1, the implementation environment includes a terminal 100 and a server 200, and the terminal 100 and the server 200 establish a communication connection in advance for data interaction. The terminal 100 runs an audio annotation program, which provides an audio annotation interface for displaying information required for audio annotation. The server 200 is configured to provide data services for the audio annotation program running in the terminal 100, for example, the terminal 100 needs to obtain information required for audio annotation from the server 200 for display, and the audio annotation corpus generated by the terminal 100 is stored in the server 200.
It should be noted that the terminal 100 may be a terminal device such as a computer, a notebook computer, a tablet, and the like, and the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud database, cloud computing, a cloud function, cloud storage, big data, and an artificial intelligence platform. .
Fig. 2 is a flow chart illustrating an audio annotation method according to an exemplary embodiment, which may be applied to the terminal 100 in the implementation environment shown in fig. 1. As shown in fig. 2, in an exemplary embodiment, the audio annotation method at least includes the following steps:
step 110, displaying the identity information of the annotation object, the video containing the annotation object and the audio graph corresponding to the audio playing progress of the video.
As mentioned above, in order to obtain the machine learning model for automatically recognizing the identity of the speaker, a large amount of audio tagging corpora are required to be used to train the machine learning model until the machine learning model has a better speaker recognition effect.
The audio tagging corpus is usually tagged manually by a tagging person, for example, after the tagging person obtains an audio file to be tagged, the identity of a speaker is identified by playing the audio file, and then the audio file is tagged according to the identity of the speaker, so as to obtain an audio tagging corpus. However, it is very difficult to distinguish the identity of the speaker only by means of the auditory sense of the annotating person, and the annotating accuracy is not high, so that the obtaining efficiency of the audio annotation corpus is very low.
In order to solve the above problems, the present embodiment provides an audio tagging method, which performs audio tagging quickly by combining multidimensional information, has high tagging accuracy, and is suitable for acquiring a large amount of audio tagging corpora.
In this embodiment, the annotation object refers to a speaker to be identified, and the identity information of the annotation object is used to describe the identity of the speaker, for example, the identity information includes information such as the name and image of the speaker. The video containing the annotation object refers to an image picture containing the annotation object in the video, and the audio graph is used for describing the audio playing progress of the video.
It should be understood that a video file usually contains video data and audio data, and the audio data is played synchronously with the video data, so that the audio playing progress displayed in the audio graph is synchronous with the video playing progress in the video playing process.
And displaying the audio graph according to the audio data, for example, displaying the audio graph in a progress bar form, wherein the total length of the progress bar corresponds to the total duration of the video, and continuously adjusting the progress change in the progress bar in the playing process of the video. The audio graph may also be displayed as an audio waveform graph, and the audio waveform graph not only can display the audio playing progress of the video, but also can display the fluctuation condition of the audio sampling rate, which is not limited in this place.
As described above, the audio annotation corpus of the annotation object can be generated based on the identity information of the annotation object and the voice data corresponding to the object to be annotated.
In one embodiment, the identity information of the annotation object, the video containing the annotation object, and the audio graphic corresponding to the audio playing progress of the video are displayed in different areas of an audio annotation interface, and the audio annotation interface is a visual display interface, so that audio annotation can be performed quickly based on the identity information of the annotation object, the video, and the audio graphic displayed on the interface. For example, according to the identity information, the video and the audio image of the annotation object displayed on the audio annotation interface, the annotation personnel can distinguish the voice of the annotation object from multiple directions, and the speed and the accuracy of audio annotation are improved.
Step 130, synchronously playing the video containing the annotation object and the audio graph corresponding to the audio playing progress of the video, and detecting the audio time interval matched with the identity information of the annotation object in the audio graph.
As mentioned above, the audio playing progress of the audio graphic display and the video playing progress of the annotation object are kept synchronous, that is, the audio playing progress of the audio graphic display and the video playing progress are kept synchronous at the same playing time. For any play time of the video play, the audio play progress displayed in the audio graph will correspond to the same play time.
Therefore, in the playing process of the video, the audio graph continuously changes the real-time playing progress along with the playing of the video, so that the audio playing progress displayed by the audio graph and the playing progress of the video correspond to the same playing time, and the audio graph and the video are ensured to be played synchronously. Thus, the playing progress change interval displayed in the audio graphic is referred to as an audio period in the audio graphic.
And because the audio graph is displayed according to the audio data corresponding to the video, and the audio time period in the audio graph corresponds to a segment of data in the audio data, the audio time period matched with the identity information of the annotation object in the audio graph corresponds to the voice data of the annotation object contained in the audio data corresponding to the video.
Therefore, in the playing process of the video, the audio time interval corresponding to the voice data of the annotation object can be positioned in the audio graph by identifying the voice data corresponding to the annotation object contained in the audio data played synchronously.
Illustratively, in the playing process of the video, the audio time interval for continuously playing the audio similar to the voice feature of the annotation object is determined as the audio time interval matched with the identity information of the annotation object by identifying the similarity degree between the playing audio and the voice feature of the annotation object.
The voice characteristics of the tagged object may include information such as a tone and a tone, and these voice characteristic information may be obtained according to the displayed identity information of the tagged object and the video containing the tagged object. For example, when the identification information of the annotation object is known, a video picture of the speaker as the annotation object is acquired from the video in advance, and the tone and tone of the annotation object are determined from the audio played in synchronization with the video picture.
In another embodiment, according to the displayed image of the annotation object, the matching degree between the playing picture and the image of the annotation object is identified in the playing process of the video, and the audio time interval corresponding to the video picture continuously displaying the image matching with the image of the annotation object is determined as the audio time interval matching with the identity information of the annotation object.
In other embodiments, the audio time interval matched with the identity information of the labeling object is determined by manual operation of the labeling person. And the annotation personnel acquires the voice characteristics of the annotation object, such as tone, tone and the like, in advance by playing the video according to the displayed identity information of the annotation object and the video containing the annotation object, then plays the video again, and determines the audio time interval corresponding to the voice data of the annotation object according to the acquired voice characteristics in advance.
For example, when starting to play and ending to play the voice of the tagged object, the tagging personnel inputs a preset audio selection instruction through a mouse or a keyboard and other equipment, the audio graph displays a corresponding starting playing progress point and an ending playing progress point when the audio selection instruction is input, and then the audio time interval between the starting playing progress point and the ending playing progress point can be determined as the audio time interval matched with the identity information of the tagged object.
It should be noted that the above manner of detecting the audio time interval matched with the identity information of the annotation object is only an example, and does not indicate that the embodiment limits the manner of detecting the audio time interval.
Moreover, the voice characteristics of the tagged object are known in advance through the displayed video and audio images and the identity information of the tagged object, and the audio time period matched with the identity information of the tagged object is accurately positioned in the audio image along with the playing of the video, so that the voice data of the tagged object can be accurately positioned in the audio data corresponding to the video, and the accurate tagged corpus can be obtained.
And 150, generating an audio labeling corpus according to the audio corresponding to the audio time period and the identity information of the labeling object.
As described above, the audio time interval in the audio graph matching the identity information of the tagged object corresponds to the voice data of the tagged object, so that the voice data of the tagged object can be tagged according to the identity information of the tagged object, and then the audio tagging corpus corresponding to the tagged object is generated.
Illustratively, based on an audio time period matched with the identity information of the tagged object in the audio graph, a corresponding section of audio data can be positioned in the audio data corresponding to the video, where the section of audio data is the voice data corresponding to the tagged object, and then the audio data obtained through positioning and the identity information of the tagged object are stored in an associated manner, that is, the voice data of the tagged object is tagged, so as to obtain the voice tagging information.
In one embodiment, the audio data corresponding to the video contains a plurality of sections of voice data of the tagged object, so that in the process of playing the video, a plurality of audio time periods matched with the identity information of the tagged object are detected in the audio graph, and each audio tagged corpus is generated according to each audio time period and the identity information of the tagged object. That is, the embodiment can conveniently and quickly obtain a plurality of audio annotation corpora related to the annotation object.
In another embodiment, the video includes a plurality of tagged objects, and the identity information of each tagged object is correspondingly displayed, in the playing process of the video, an audio time period matched with the identity information of each tagged object is detected in the audio graph, and an audio tagged corpus is generated according to the detected audio time period and the identity information of the corresponding tagged object. Therefore, the embodiment can also conveniently and quickly obtain a plurality of audio labeling linguistic data, and the labeling objects corresponding to each audio labeling linguistic data may be different.
Therefore, in the embodiment, by displaying the identity information of the tagged object, the video including the tagged object, and the audio graph corresponding to the audio playing progress of the video, and in the process of synchronously playing the video and the audio graph, the audio time interval matched with the identity information of the tagged object is detected in the audio graph, and the audio tagging corpus is generated according to the detected audio time interval and the identity information of the tagged object, so that the audio data corresponding to different tagged objects respectively can be identified by combining with multi-party information in the audio tagging process, or the multi-segment audio data of the tagged object can be identified, and the audio tagging corpus can be automatically generated, thereby greatly improving the audio tagging efficiency, and being applicable to the acquisition of a large number of audio tagging corpora.
Fig. 3 is a flow chart of step 110 in an exemplary embodiment of the embodiment shown in fig. 2. As shown in FIG. 3, in an exemplary embodiment, step 110 includes at least the following steps:
and step 111, respectively acquiring the identity information of the annotation object, including the video data of the annotation object and the audio sampling rate data corresponding to the video.
As mentioned above, the identity information of the annotation object may include information such as a name and an image, and the image of the annotation object includes the face feature of the annotation object, so that the identity of the annotation object displayed in the video can be accurately identified according to the image of the annotation object.
For example, the image of the annotation object may include image formats such as jpeg (Joint Photographic Expert Group), png (Portable Network Graphics), and the like, which is not limited herein.
The video data containing the annotation object may be mp4 (which is a set of compression coding standards for audio and video information), MPEG (Moving Picture Experts Group) and other video formats, and the present invention is not limited thereto.
The audio sample rate data corresponding to the video data may be extracted from the video data by FFmpeg (a set of open source computer programs that can be used to record, convert, and convert digital audio, video, etc. into a stream).
In one embodiment, the identity information of the annotation object, the video data containing the annotation object, and the audio sampling rate data corresponding to the video are stored locally, for example, before audio annotation, the data are obtained by copying, downloading, and the like in advance, and are stored locally, and can be obtained locally when audio annotation is performed.
In another embodiment, the identification information of the annotation object, the video data containing the annotation object, and the audio sampling rate data corresponding to the video are stored in the server, so that the data need to be queried from the server.
It should be noted that, these data may be obtained by query from the server or locally, simultaneously, or by query separately, which is not limited herein.
And 113, displaying the video according to the video data, displaying the identity information of the annotation object, and drawing an audio graph corresponding to the playing progress of the video according to the audio sampling rate data.
For the display of the video data and the identity information of the Markup object, the identity information of the Markup object and the video data may be configured as HTML (Hyper Text Markup Language, a syntax rule for marking how the web page information is displayed and other characteristics) tags matched with respective data types, and then the display of the identity information of the Markup object and the display of the video containing the Markup object are performed according to the configured HTML tags.
For example, the name of the annotation object may be configured as a < cable > tag, the image of the annotation object may be configured as an < img > tag, and the video data may be configured as a < video > tag, and the display of the corresponding data in the annotation interface may be performed through these HTML tags.
The drawing of the audio graphic may be achieved by calling Canvas API (Canvas application program interface). The Canvas API is a technology for generating images in real time in a web page and manipulating the content of the images, and draws audio graphics on an audio annotation interface using JavaScript (a lightweight, interpreted or just-in-time programming language with function preference) through a Canvas element of HTML5 (a version of hypertext markup language).
FIG. 4 is a schematic diagram illustrating an audio annotation interface in accordance with an exemplary embodiment. As shown in fig. 4, the audio annotation interface is divided into several interface areas, and the image and name of the annotation object and the video containing the annotation object can be displayed in the corresponding interface areas according to the configured HTML tag, and the audio graphics can be drawn in the corresponding interface areas according to the audio sampling rate data corresponding to the video.
It should be understood that the audio graph drawn based on the audio sampling rate data is an audio waveform graph, the audio waveform graph reflects the fluctuation condition of the audio sampling rate, and an experienced annotator can identify the audio time interval matched with the identity information of the annotation object according to the fluctuation condition, so that the efficiency of annotating the corpus by the annotator is improved.
In other embodiments, the audio sampling rate data is obtained by sampling the audio of the annotation object from the video data, so that only the audio fluctuation of the annotation object is displayed in the audio graph.
For example, in the audio annotation interface shown in fig. 5, the audio waveform graph drawn according to the audio sampling rate data only shows the fluctuation condition of the audio sampling rate of the annotation object, and as the video is played, the cursor used for identifying the real-time playing progress point in the audio graph will also change synchronously. According to the audio annotation interface shown in fig. 5, the annotation personnel can identify the voice data of the annotation object more easily, and the efficiency and accuracy of audio annotation can be further improved.
FIG. 6 is a flowchart of an exemplary embodiment of step 111 in the embodiment shown in FIG. 3. As shown in fig. 6, step 111 may include the steps of:
step 210, detecting an input label object identifier;
step 230, initiating a data query request to the server according to the detected label object identifier;
and 250, receiving a query result returned by the server according to the data query request, wherein the query result contains video data, identity information and audio sampling rate data corresponding to the label object identifier.
It should be noted that, in this embodiment, the identity information of the annotation object, the video data including the annotation object, and the audio sampling rate data corresponding to the video are all stored in association in the server based on the annotation object identifier, so that these data need to be queried from the server according to the annotation object identifier to display these data.
The label object identifier may be an ID (identity identifier) of the label object, or may be information such as a name of the label object, and is not limited herein.
In order to determine the data query target, the annotation interface detects the input annotation object identifier. Illustratively, a data query area is arranged in the audio annotation interface, and an annotation person can input identification information of an annotation object to be annotated in the data query area, so that a data query request can be initiated to the server. The server side can search the identity information, the video data and the audio sampling rate data which are stored in association with the identification object identification in the database according to the received identification object identification, and return the data to the audio labeling interface for displaying.
In one embodiment, to reduce redundant information in the database of the server, the identity information of the annotation object, the video data containing the annotation object, and the audio sampling rate data corresponding to the video may be structurally stored in the server through two data tables.
Illustratively, one of the data tables is used to store basic information of the annotation object, such as the annotation object ID, name and image shown in table 1 below.
ID Name (I) Image of a person
1 *** ***.jpg
2 *** ***.jpg
TABLE 1
Another data table is used to store the video data containing the annotation object and the audio sampling rate data corresponding to the video, and store the corresponding relationship between the video and the annotation object, as shown in the following table 2:
ID video data Audio sample rate data
1 ***.mp4 ***
2 ***.mp4 ***
1 ***.mp4 ***
1 ***.mp4 ***
TABLE 2
As can be seen from table 2, at least one piece of video data can be stored in association with an ID of one annotation object.
The server side inquires the data related to the labeled object in each data table through table connection according to the ID of the labeled object, and can obtain the result shown in table 3:
ID name (I) Image of a person Video data Audio sample rate data
1 *** ***.jpg ***.mp4 ***
1 *** ***.jpg ***.mp4 ***
1 *** ***.jpg ***.mp4 ***
2 *** ***.jpg ***.mp4 ***
TABLE 3
Therefore, the server side returns the name and the image of the annotation object, at least one piece of video data containing the annotation object and the audio sampling rate data corresponding to each piece of video data according to the ID of the annotation object contained in the query request.
In another exemplary embodiment, in order to ensure the security of the database, the server is further provided with a database access right, so that the server needs to query relevant data according to the account information for performing audio annotation logged in the audio annotation interface.
And the audio annotation interface generates a data query request according to the input annotation object identifier and the logged account information for audio annotation, and sends the data query request to the server, so that the server returns the relevant data stored in the database in association with the annotation object identifier to the audio annotation interface after verifying that the account information has the database access right. And the audio annotation interface can display the related data according to the returned related data.
For example, based on the data query process provided in this embodiment, a flowchart of displaying related data by the audio annotation interface is shown in fig. 7. The server side inquires all the related data to be subjected to audio annotation from each data table stored in the database through table connection, and inquires the data related to the annotation object according to the annotation object identification contained in the data inquiry request, wherein the data comprises the name and the image of the annotation object, the video data containing the annotation object and the audio sampling rate data extracted from the video data. And the labeling interface configures corresponding HTML tags according to the data, executes corresponding display according to the configured HTML and draws audio graphics according to the audio sampling rate data. The related information displayed in the audio annotation interface can provide a way for annotating personnel to reliably identify the voice of the annotation object, so that the efficiency of annotating personnel for audio annotation is greatly improved.
Fig. 8 is a flowchart of an exemplary embodiment of step 130 in the embodiment of fig. 2. As shown in fig. 8, in an exemplary embodiment, step 130 may include the steps of:
step 131, when a video playing instruction is detected, synchronously playing video and audio graphics;
it should be noted that the video playing command is a preset command for instructing playing or pausing of video and audio graphics.
The video playing instruction may be input through a mouse or a keyboard, or input through touching an audio tagging interface on which the identity information of the video, the audio graph, and the tagging object is displayed, which is not limited in this embodiment. For example, when the annotator clicks a video play/pause button displayed in the audio annotation interface, it is considered that a video play instruction is detected.
And step 133, when the audio selection instruction is detected, positioning an audio time interval matched with the identity information of the annotation object in the audio graph.
The audio selection instruction is also a preset instruction and is used for positioning the starting position and the ending position of the audio time interval matched with the identity information of the marked object in the audio graph. When the number of the labeled objects is multiple, different audio selection instructions can be set for each labeled object respectively so as to distinguish different labeled objects according to the audio selection instructions, ensure that the identity of the labeled objects in the generated audio labeling corpus is accurate, and further ensure the accuracy of the audio labeling corpus.
The audio selection instruction may also be input through a device such as a mouse or a keyboard, or input by touching a tagging interface on which the identity information of the video, the audio graphics, and the tagging object is displayed, which is not limited in this embodiment.
For example, in a specific embodiment, in the playing process of the video, when the annotating person recognizes that the currently played audio is the voice of the annotation object, the annotating person inputs an audio selection instruction by pressing an enter key of the keyboard, and presses the enter key again to input the audio selection instruction when the voice of the annotation object finishes playing, so that the start position and the end position of the audio time period can be determined based on the audio selection instructions input twice before and after, and the audio time period matched with the identity information of the annotation object is obtained.
In order to locate the audio time interval matching the identity information of the annotation object in the audio graph, the start position and the end position of the audio time interval need to be located in the audio graph according to the audio selection instruction, so that the audio time interval between the start position and the end position is determined as the audio time interval matching the identity information of the annotation object.
In one embodiment, the real-time play progress point displayed in the audio graphic is determined as the start position when the audio selection instruction is detected an odd number of times, and the real-time play progress point displayed in the audio graphic is determined as the end position corresponding to the start position determined when the audio selection instruction was detected a previous time when the audio selection instruction was detected an even number of times.
For example, if the audio data corresponding to the video only contains a segment of voice data of the annotation object, the audio selection instruction input twice will be detected, the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the first time is determined as the starting position, correspondingly, the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the second time is determined as the ending position, and the time interval between the starting position and the ending position is matched with the identity information of the annotation object.
Similarly, if the audio data corresponding to the video contains a plurality of sections of voice data of the annotation object, the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the odd number of times is determined as the starting position, and the real-time playing progress point displayed in the audio graph when the audio selection instruction is input for the even number of times is determined as the ending position corresponding to the starting position determined when the audio selection instruction is input for the previous time. Therefore, for each determined starting position, a corresponding ending position is obtained, and then a plurality of audio time periods matched with the identity information of the labeling object are obtained.
If the audio data corresponding to the video contains the voice data of a plurality of labeled objects, when an input audio selection instruction is detected, the corresponding labeled objects are determined according to the audio selection instruction, and further, the identity information of the labeled objects corresponding to each audio time period is correspondingly obtained while a plurality of audio time periods are obtained.
In another embodiment, after the audio time interval matched with the identity information of the labeled object is located and obtained in the audio graph according to the audio selection instruction, the located audio time interval can be finely adjusted, so that the finally determined audio time interval is more accurate.
As shown in fig. 9, in the audio graph displayed on the audio labeling interface, the start position and the end position of the audio time interval are marked by using mark lines, and a marker can pull the mark lines at two ends of the audio time interval by a mouse to perform fine adjustment of the audio time interval.
Therefore, in the embodiment, the input audio selection instruction is detected in the video playing process, and the audio time period matched with the identity information of the labeled object is positioned in the audio graph according to the detected audio selection instruction, so that at least one section of voice data of at least one labeled object can be accurately positioned in the audio data corresponding to the video, and the requirement on the acquisition scene of a large number of audio labeled corpora is met.
FIG. 10 is a flowchart illustrating an audio annotation process according to another exemplary embodiment. As shown in fig. 10, in an exemplary embodiment, the audio annotation method further includes the following steps:
step 310, intercepting a target image containing a speaker picture from a pre-collected video;
step 330, using the speaker as the annotation object contained in the video, and using the target image as the identity information of the annotation object.
Considering that the acquisition of the audio annotation corpus is also related to the collection process of the video data, if the collection efficiency of the video data can be improved, the acquisition efficiency of the audio annotation corpus can be improved to a certain extent, so that the embodiment improves the collection process of the video data.
In order to obtain a large amount of videos quickly, a web crawler (also called a web spider or a web robot, which is a program or script for automatically capturing web information according to a certain rule) is usually used to automatically collect video data in a network, but the tagging objects contained in the video data need to be manually added by a person responsible for video collection, which is very inefficient.
When recording video, a photographer usually aims the camera at a target person when the target person speaks, so that in the process of playing recorded video, a speaker picture and a speaker audio are played synchronously, and audiences can obtain better watching experience.
Therefore, the embodiment captures the target image containing the speaker image from the pre-collected video, takes the speaker as the annotation object contained in the video, and takes the target image as the identity information of the annotation object, so that the identity information of the annotation object can be accurately obtained.
In one embodiment, face recognition is carried out on each video collected in advance, and a video containing face features is determined so as to intercept a target image from the video containing the face features. Specifically, face recognition is performed on each collected video in advance, and if it is determined that the video does not contain face features, it indicates that a labeling object cannot be determined for the video, and the videos are not used as videos to be subjected to audio labeling. For videos containing face features, target images containing speaker pictures can be intercepted from the videos, and the videos are applied to audio annotation carried out subsequently.
Therefore, the embodiment can automatically generate the identity information of the annotation object contained in the video aiming at the collected video, the manual addition of the personnel responsible for video collection is not needed, and the audio annotation efficiency is integrally improved. The embodiment can also quickly filter the video without the speaker picture, and can ensure that the subsequent video data for audio annotation is effective.
In order to facilitate understanding of the technical spirit of the present application, the following describes the audio annotation method proposed in the above embodiment in a specific application scenario.
An exemplary audio annotation process is shown in fig. 11, and the audio annotation process includes three stages of displaying a data source to be annotated, annotating the data, and obtaining an annotation result. The data source to be labeled comprises the image and name of the labeled object, and the video and audio sampling data containing the labeled object. As shown in fig. 4, fig. 5, and fig. 9, these data sources to be labeled are all displayed in the audio labeling interface of the terminal, and the audio sampling rate is specifically displayed as an audio waveform diagram. The annotation personnel determines the audio time interval corresponding to the annotation object in the playing process of the video, the terminal divides the audio segment in the audio oscillogram according to the audio time interval determined by the annotation personnel, and then generates an annotation result according to the divided audio segment and the name of the annotation object. It should be understood that segmenting the audio segment is a process of obtaining corresponding audio data according to the determined audio period.
Specifically, an audio annotation account is logged in an audio annotation interface of the terminal, an annotation worker operates the audio annotation interface to enable the terminal to inquire a data source to be annotated to a server, and after the server verifies that the audio annotation account has inquiry authority, the server returns the relevant data source to the terminal.
After receiving the related data sources, the terminal reads each data source in sequence according to the data sequence, displays the image, the name and the first frame image of the video containing the annotation object in the audio annotation interface, and draws an audio waveform diagram in the audio annotation interface according to the audio sampling rate data.
When the terminal receives a video playing instruction clicked by a marking person, the audio marking interface starts to play the video, and the cursor in the audio graph and the playing time of the video synchronously move. When the video is played to the audio time interval of the speaking of the marked object, the marking personnel quickly divides the audio time interval by pressing an enter key on a keyboard, the terminal obtains a starting time point and an ending time point for expressing the speaking of the object according to the divided audio time interval, and marks the name of the marked object in the time interval between the starting time point and the ending time point, so that the audio marking corpus can be generated, and the marking work of the video is completed.
The generated audio Markup corpus is an xml (eXtensible Markup Language) file, and for example, the generated audio Markup corpus is:
< Turn starttime ═ 64.972826 "endTime ═ 94.993206" type ═ zhangsane >.
If the speaker identity is identified and the audio annotation is carried out only by playing the audio file, the annotating personnel can complete the audio annotation for 10 minutes per hour on average, and after the method is adopted, the annotating personnel can complete the audio annotation for 25-35 minutes per hour on average, and the annotation accuracy is improved from 80% to 95%.
Therefore, the audio annotation method can enable annotation personnel to rapidly finish annotation work of a large number of videos, and annotation efficiency is greatly improved.
FIG. 12 is a block diagram illustrating an audio annotation device that may be adapted for use with the terminal 100 in the implementation environment shown in FIG. 1, according to an exemplary embodiment. As shown in fig. 12, the audio annotation device includes an information display module 410, a play detection module 430, and an information generation module 450.
The information display module 410 is configured to display the identity information of the annotation object, the video containing the annotation object, and the audio graph corresponding to the audio playing progress of the video. The play detection module 430 is configured to play the video and audio graphics synchronously, and detect an audio time interval matching the identity information of the annotation object in the audio graphics. The information generating module 450 is configured to generate an audio tagging corpus according to the audio corresponding to the audio time period and the identity information of the tagging object.
In another exemplary embodiment, the play detection module 430 includes an instruction detection unit and an audio positioning unit. The instruction detection unit is used for synchronously playing the video and the audio graphics when detecting the video playing instruction. The audio positioning unit is used for positioning the audio time interval matched with the identity information of the marked object in the audio graph when the audio selection instruction is detected.
In another exemplary embodiment, the audio localization unit comprises a position localization subunit and a match determination subunit. The position locating subunit is used for locating the starting position and the ending position of the audio time interval in the audio graph according to the audio selection instruction. The matching determination subunit is configured to determine an audio period between the start position and the end position as an audio period matching the identity information of the annotation object.
In another exemplary embodiment, the position location subunit comprises a start position determination subunit and an end position determination subunit. The start position determining subunit is configured to determine, as the start position, the real-time playing progress point displayed in the audio graphic when the audio selection instruction is detected an odd number of times. The end position determining subunit is configured to, when the audio selection instruction is detected an even number of times, determine the real-time play progress point displayed in the audio graphic as an end position corresponding to the start position determined when the audio selection instruction was detected last time.
In another exemplary embodiment, the information display module 410 includes a sampling rate acquisition unit and a graphic rendering unit. The sampling rate acquisition unit is used for acquiring audio sampling rate data corresponding to the video. And the graph drawing unit is used for drawing an audio graph corresponding to the audio playing progress of the video according to the audio sampling rate data.
In another exemplary embodiment, the information display module 410 further includes a data acquisition unit, a tag configuration unit, and a tag display unit. The data acquisition unit is used for acquiring the identity information of the annotation object and the video data containing the annotation object. The tag configuration unit is used for respectively configuring the identity information of the annotation object and the video data into HTML tags matched with respective data types. The tag display unit is used for displaying the identity information of the labeled object according to the HTML tag and displaying the video containing the labeled object.
In another exemplary embodiment, the data acquisition unit includes an identification detection subunit, a request transmission subunit, and a result reception subunit. The mark detection subunit is used for detecting the input mark object mark. And the request sending subunit is used for initiating a data query request to the server according to the detected labeled object identifier and the account information for audio labeling. And the result receiving subunit is used for receiving a query result returned by the server according to the data query request, wherein the query result contains the video data corresponding to the label object identifier and the identity information.
In another exemplary embodiment, the apparatus further comprises an image interception module and an information acquisition module. The image intercepting module is used for intercepting a target image containing a speaker picture from a pre-collected video. The information acquisition module is used for taking the speaker as an annotation object contained in the video and taking the target image as the identity information of the annotation object.
In another exemplary embodiment, the image intercepting module includes a face recognition unit and a feature intercepting unit. The face recognition unit is used for carrying out face recognition on each pre-collected video and determining the video containing the face features. The feature intercepting unit is used for intercepting a target image from a video containing human face features.
In another exemplary embodiment, the information generating module 450 includes an audio data acquiring unit and an associated storage unit. The audio data acquisition unit is used for acquiring audio data corresponding to the audio time interval, and the associated storage unit is used for storing the audio data and the identity information of the labeled object in an associated manner to obtain the audio labeling corpus.
It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit execute operations has been described in detail in the method embodiment, and is not described again here.
Embodiments of the present application further provide an audio annotation device, which includes a processor and a memory, wherein the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, implement the audio annotation method as described above.
FIG. 13 is a block diagram illustrating an audio annotation device in accordance with an exemplary embodiment.
It should be noted that the audio annotation device is only an example adapted to the application and should not be considered as providing any limitation to the scope of use of the application. The audio annotation device also cannot be construed as requiring reliance on, or necessity of, one or more of the components of the exemplary audio annotation device illustrated in FIG. 13.
As shown in FIG. 13, in an exemplary embodiment, the audio annotation device includes a processing component 501, a memory 502, a power component 503, a multimedia component 504, an audio component 505, a sensor component 507, and a communication component 508. The above components are not all necessary, and the audio annotation device may add other components or reduce some components according to its own functional requirements, which is not limited in this embodiment.
The processing component 501 generally controls the overall operation of the audio annotation device, such as operations associated with display, data communication, and log data processing. The processing component 501 may include one or more processors 509 to execute instructions to perform all or a portion of the steps of the above-described operations. Further, the processing component 501 may include one or more modules that facilitate interaction between the processing component 501 and other components. For example, the processing component 501 may include a multimedia module to facilitate interaction between the multimedia component 504 and the processing component 501.
The memory 502 is configured to store various types of data to support operation at the audio annotation device, examples of which include instructions for any application or method operating on the audio annotation device. The memory 502 has one or more modules stored therein, which are configured to be executed by the one or more processors 509 to perform all or part of the steps of the audio annotation methods described in the above embodiments.
The power component 503 provides power to the various components of the audio annotation device. The power components 503 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the audio annotation device.
The multimedia component 504 includes a screen that provides an output interface between the audio annotation device and the user. In some embodiments, the screen may include a TP (Touch Panel) and an LCD (Liquid crystal display). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 505 is configured to output and/or input audio signals. For example, the audio component 505 includes a microphone configured to receive external audio signals when the audio annotation device is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. In some embodiments, audio component 505 further comprises a speaker for outputting audio signals.
The sensor component 507 includes one or more sensors for providing status assessment of various aspects of the audio annotation device. For example, the sensor component 507 can detect the on/off status of the audio annotation device, and can also detect a change in temperature of the audio annotation device.
The communication component 508 is configured to facilitate communication between the audio annotation device and other devices in a wired or wireless manner. The audio annotation device may have access to a Wireless network based on a communication standard, such as Wi-Fi (Wireless-Fidelity).
It will be appreciated that the arrangement shown in figure 13 is merely illustrative and that the audio annotation device may include more or fewer components than shown in figure 13, or have different components than shown in figure 13. Each of the components shown in fig. 13 may be implemented in hardware, software, or a combination thereof.
Yet another aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an audio annotation method as described above. The computer readable storage medium may be included in the audio annotation device described in the above embodiments, or may exist separately without being assembled into the audio annotation device.
The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. An audio annotation method, comprising:
displaying the identity information of the marked object, the video containing the marked object and an audio graphic corresponding to the audio playing progress of the video;
synchronously playing the video and the audio graph, and detecting an audio time interval matched with the identity information of the marked object in the audio graph;
and generating an audio labeling corpus according to the audio corresponding to the audio time period and the identity information of the labeling object.
2. The method of claim 1, wherein the synchronously playing the video and the audio graphic and detecting the audio period in the audio graphic matching the identity information of the annotation object comprises:
when a video playing instruction is detected, synchronously playing the video and the audio graphics;
when an audio selection instruction is detected, locating an audio time period matched with the identity information of the labeling object in the audio graph.
3. The method of claim 2, wherein locating the audio period in the audio graphic that matches the identity information of the annotation object comprises:
locating a start position and an end position of an audio time period in the audio graphic according to the audio selection instruction;
and determining an audio time interval between the starting position and the ending position as the audio time interval matched with the identity information of the annotation object.
4. The method of claim 3, wherein locating a start position and an end position of an audio period in the audio graphic according to the audio selection instruction comprises:
determining a real-time playing progress point displayed in the audio graph as the starting position when the audio selection instruction is detected for the odd number of times;
and when the audio selection instruction is detected for the even number of times, determining a real-time playing progress point displayed in the audio graph as an end position corresponding to the start position determined when the audio selection instruction is detected for the previous time.
5. The method of claim 1, wherein displaying an audio graphic corresponding to the audio playback progress of the video comprises:
acquiring audio sampling rate data corresponding to the video;
and drawing an audio graph corresponding to the audio playing progress of the video according to the audio sampling rate data.
6. The method of claim 1, wherein the audio graphic comprises an audio waveform showing audio sampling rate fluctuation and audio playing progress of the video.
7. The method of claim 1, wherein displaying the identity information of the annotation object and the video containing the annotation object comprises:
acquiring identity information of a tagged object and video data containing the tagged object;
respectively configuring the identity information of the labeled object and the video data into HTML tags matched with respective data types;
and displaying the identity information of the labeled object and displaying the video containing the labeled object according to the HTML label.
8. The method of claim 7, wherein obtaining the identity information of the annotation object and the video data containing the annotation object comprises:
detecting an input label object identifier;
initiating a data query request to a server according to the detected labeled object identifier and the account information for audio labeling;
and receiving a query result returned by the server according to the data query request, wherein the query result contains video data and identity information corresponding to the label object identifier.
9. The method of claim 1, further comprising:
intercepting a target image containing a speaker picture from a pre-collected video;
and taking the speaker as an annotation object contained in the video, and taking the target image as the identity information of the annotation object.
10. The method of claim 9, wherein capturing the target image containing the speaker's frame from a pre-collected video comprises:
carrying out face recognition on each pre-collected video to determine a video containing face features;
and intercepting the target image from the video containing the human face features.
11. The method according to claim 1, wherein generating an audio tagging corpus according to the audio frequency corresponding to the audio time period and the identity information of the tagging object comprises:
acquiring audio data corresponding to the audio time interval;
and performing associated storage on the audio data and the identity information of the labeled object to obtain the audio labeled corpus.
12. The method of any one of claims 1 to 11, wherein displaying the identity information of the annotation object, the video containing the annotation object, and the audio graphic corresponding to the audio playing progress of the video comprises:
and displaying an audio annotation interface, wherein the identity information of the annotation object, the video containing the annotation object and the audio graph are respectively displayed in different areas on the audio annotation interface.
13. An audio annotation device, comprising:
the information display module is used for displaying the identity information of the marked object, the video containing the marked object and the audio graphic corresponding to the audio playing progress of the video;
the playing detection module is used for synchronously playing the video and the audio graph and detecting an audio time interval matched with the identity information of the marked object in the audio graph;
and the information generation module is used for generating audio labeling linguistic data according to the audio corresponding to the audio time interval and the identity information of the labeling object.
14. An audio annotation device, comprising:
a memory storing computer readable instructions;
a processor to read computer readable instructions stored by the memory to perform the method of any of claims 1-12.
15. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-12.
CN202010371102.XA 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium Active CN111629267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371102.XA CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371102.XA CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111629267A true CN111629267A (en) 2020-09-04
CN111629267B CN111629267B (en) 2023-06-09

Family

ID=72259723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371102.XA Active CN111629267B (en) 2020-04-30 2020-04-30 Audio labeling method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111629267B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487238A (en) * 2020-10-27 2021-03-12 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
CN113096643A (en) * 2021-03-25 2021-07-09 北京百度网讯科技有限公司 Video processing method and device

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6404925B1 (en) * 1999-03-11 2002-06-11 Fuji Xerox Co., Ltd. Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
JP2007101945A (en) * 2005-10-05 2007-04-19 Fujifilm Corp Apparatus, method, and program for processing video data with audio
CN101382937A (en) * 2008-07-01 2009-03-11 深圳先进技术研究院 Multimedia resource processing method based on speech recognition and on-line teaching system thereof
CN102779508A (en) * 2012-03-31 2012-11-14 安徽科大讯飞信息科技股份有限公司 Speech corpus generating device and method, speech synthesizing system and method
US20140164927A1 (en) * 2011-09-27 2014-06-12 Picsured, Inc. Talk Tags
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
CN107430858A (en) * 2015-03-20 2017-12-01 微软技术许可有限责任公司 The metadata of transmission mark current speaker
US20180174600A1 (en) * 2016-12-16 2018-06-21 Google Inc. Associating faces with voices for speaker diarization within videos
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109361886A (en) * 2018-10-24 2019-02-19 杭州叙简科技股份有限公司 A kind of conference video recording labeling system based on sound detection
CN109660744A (en) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 The double recording methods of intelligence, equipment, storage medium and device based on big data
CN109815360A (en) * 2019-01-28 2019-05-28 腾讯科技(深圳)有限公司 Processing method, device and the equipment of audio data
CN109814718A (en) * 2019-01-30 2019-05-28 天津大学 A kind of multi-modal information acquisition system based on Kinect V2
CN110008378A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN110427930A (en) * 2019-07-29 2019-11-08 中国工商银行股份有限公司 Multimedia data processing method and device, electronic equipment and readable storage medium storing program for executing
CN110600010A (en) * 2019-09-20 2019-12-20 上海优扬新媒信息技术有限公司 Corpus extraction method and apparatus
WO2020052405A1 (en) * 2018-09-10 2020-03-19 腾讯科技(深圳)有限公司 Corpus annotation set generation method and apparatus, electronic device, and storage medium

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6404925B1 (en) * 1999-03-11 2002-06-11 Fuji Xerox Co., Ltd. Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
JP2007101945A (en) * 2005-10-05 2007-04-19 Fujifilm Corp Apparatus, method, and program for processing video data with audio
CN101382937A (en) * 2008-07-01 2009-03-11 深圳先进技术研究院 Multimedia resource processing method based on speech recognition and on-line teaching system thereof
US20140164927A1 (en) * 2011-09-27 2014-06-12 Picsured, Inc. Talk Tags
CN102779508A (en) * 2012-03-31 2012-11-14 安徽科大讯飞信息科技股份有限公司 Speech corpus generating device and method, speech synthesizing system and method
CN107430858A (en) * 2015-03-20 2017-12-01 微软技术许可有限责任公司 The metadata of transmission mark current speaker
WO2016165346A1 (en) * 2015-09-16 2016-10-20 中兴通讯股份有限公司 Method and apparatus for storing and playing audio file
US20180174600A1 (en) * 2016-12-16 2018-06-21 Google Inc. Associating faces with voices for speaker diarization within videos
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
WO2020052405A1 (en) * 2018-09-10 2020-03-19 腾讯科技(深圳)有限公司 Corpus annotation set generation method and apparatus, electronic device, and storage medium
CN109660744A (en) * 2018-10-19 2019-04-19 深圳壹账通智能科技有限公司 The double recording methods of intelligence, equipment, storage medium and device based on big data
CN109361886A (en) * 2018-10-24 2019-02-19 杭州叙简科技股份有限公司 A kind of conference video recording labeling system based on sound detection
CN109815360A (en) * 2019-01-28 2019-05-28 腾讯科技(深圳)有限公司 Processing method, device and the equipment of audio data
CN110008378A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN109814718A (en) * 2019-01-30 2019-05-28 天津大学 A kind of multi-modal information acquisition system based on Kinect V2
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN110427930A (en) * 2019-07-29 2019-11-08 中国工商银行股份有限公司 Multimedia data processing method and device, electronic equipment and readable storage medium storing program for executing
CN110600010A (en) * 2019-09-20 2019-12-20 上海优扬新媒信息技术有限公司 Corpus extraction method and apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487238A (en) * 2020-10-27 2021-03-12 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
CN112487238B (en) * 2020-10-27 2024-05-17 百果园技术(新加坡)有限公司 Audio processing method, device, terminal and medium
CN113096643A (en) * 2021-03-25 2021-07-09 北京百度网讯科技有限公司 Video processing method and device

Also Published As

Publication number Publication date
CN111629267B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
US20200065322A1 (en) Multimedia content tags
US9560411B2 (en) Method and apparatus for generating meta data of content
US8930308B1 (en) Methods and systems of associating metadata with media
KR20190139751A (en) Method and apparatus for processing video
US20160050465A1 (en) Dynamically targeted ad augmentation in video
CN110740389B (en) Video positioning method, video positioning device, computer readable medium and electronic equipment
US9372601B2 (en) Information processing apparatus, information processing method, and program
CN111629267B (en) Audio labeling method, device, equipment and computer readable storage medium
CN105828179A (en) Video positioning method and device
CN112188267A (en) Video playing method, device and equipment and computer storage medium
CN113515997A (en) Video data processing method and device and readable storage medium
CN111741321A (en) Live broadcast control method, device, equipment and computer storage medium
Quiros et al. Covfee: an extensible web framework for continuous-time annotation of human behavior
CN114390368A (en) Live video data processing method and device, equipment and readable medium
CN106951405B (en) Data processing method and device based on typesetting engine
Miniakhmetova et al. An approach to personalized video summarization based on user preferences analysis
CN111274449A (en) Video playing method and device, electronic equipment and storage medium
KR20150112113A (en) Method for managing online lecture contents based on event processing
CN115209233B (en) Video playing method, related device and equipment
CN112601129B (en) Video interaction system, method and receiving terminal
EP3861424A1 (en) Collecting of points of interest on web-pages by eye-tracking
WO2015178014A1 (en) Learning support system, learning support server, learning support method, and learning support program
CN113438532B (en) Video processing method, video playing method, video processing device, video playing device, electronic equipment and storage medium
CN113596494B (en) Information processing method, apparatus, electronic device, storage medium, and program product
KR101328270B1 (en) Annotation method and augmenting video process in video stream for smart tv contents and system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40027486

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant