CN107274916B - Method and device for operating audio/video file based on voiceprint information - Google Patents

Method and device for operating audio/video file based on voiceprint information Download PDF

Info

Publication number
CN107274916B
CN107274916B CN201710439537.1A CN201710439537A CN107274916B CN 107274916 B CN107274916 B CN 107274916B CN 201710439537 A CN201710439537 A CN 201710439537A CN 107274916 B CN107274916 B CN 107274916B
Authority
CN
China
Prior art keywords
audio
voiceprint information
contact
target
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710439537.1A
Other languages
Chinese (zh)
Other versions
CN107274916A (en
Inventor
杨帆
苏腾荣
李世全
马永健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Samsung Telecommunications Technology Research Co Ltd
Samsung Electronics Co Ltd
Original Assignee
Beijing Samsung Telecommunications Technology Research Co Ltd
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Samsung Telecommunications Technology Research Co Ltd, Samsung Electronics Co Ltd filed Critical Beijing Samsung Telecommunications Technology Research Co Ltd
Priority to CN201710439537.1A priority Critical patent/CN107274916B/en
Publication of CN107274916A publication Critical patent/CN107274916A/en
Application granted granted Critical
Publication of CN107274916B publication Critical patent/CN107274916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Abstract

The invention discloses a method for operating an audio/video file based on voiceprint information, which comprises the following steps: collecting voiceprint information of a sounding target; and searching for an audio/video file according to the voiceprint information. The invention also provides the terminal equipment. According to the technical scheme provided by the invention, the audio/video files can be classified according to the voiceprint information of the specific contact person, and when a user wants to find the audio/video files containing the specific contact person, the user does not need to play and check one file but directly selects the audio/video files, so that the user can conveniently find the audio/video files containing the sound of the specific person. Furthermore, the method for operating the audio/video file based on the voiceprint information can directly jump to a time node of speaking of a certain contact in the audio/video for playing, so that the searching efficiency of a user is improved.

Description

Method and device for operating audio/video file based on voiceprint information
The application is a divisional application of chinese patent application No.201210518118.4 entitled "method and apparatus for operating audio/video files based on voiceprint information" filed on 12/05/2012.
Technical Field
The invention relates to the field of communication application of mobile equipment, in particular to a method and a device for audio and video operation of terminal equipment according to voiceprints of a specific contact.
Background
The recorder or the camera on the existing terminal equipment can facilitate users to record and shoot audio and video files. With the improvement of the performance of the terminal equipment, the increase of the storage capacity, the increase of the variety of the multimedia application programs and other conditions, a user can easily record or shoot a large number of audio/video files. However, when a user needs to search all audio/video files recorded with a specific contact or search and play a specific piece of information of a specific contact in a specific audio/video file in the presence of a large number of audio/video files, the user may encounter the situation of no search because of the inability to quickly locate the specific information. Only one file can be played and viewed to obtain the required file or segment.
In view of the foregoing, there is a need to provide a method and a terminal device for quickly searching and classifying a target audio/video file and locating a time point of occurrence of a specific contact in the file, so as to facilitate a user to search for a file recorded with sound and video of a specific person.
Disclosure of Invention
In order to solve the technical problem, a user can quickly search for files recorded with voices or videos of specific personnel.
One of the objectives of the present invention is to provide a method for operating an audio/video file based on voiceprint information, comprising the following steps: collecting voiceprint information of a sounding target; searching audio/video files according to the voiceprint information; wherein all recorded sounds in the audio/video file are divided into a plurality of voice units, each voice unit contains only the voice of one of the voice production targets, and the time point of the voice production target in the audio/video file is recorded.
Another object of the present invention is to provide a terminal device, comprising: the voice print extraction module is used for collecting voice print information of a sounding target; the execution module is used for searching audio/video files according to the voiceprint information; wherein all recorded sounds in the audio/video file are divided into a plurality of voice units, each voice unit contains only the voice of one of the voice production targets, and the time point of the voice production target in the audio/video file is recorded.
The method and the device provided by the invention can quickly search the file recorded with the voice or video of the specific personnel so as to improve the searching efficiency of the user.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a schematic flow diagram according to an embodiment of the invention;
FIG. 2 is a schematic interface diagram of a terminal device before audio acquisition according to an embodiment of the invention;
FIG. 3 shows a flow diagram of audio acquisition according to an embodiment of the invention;
fig. 4 is a schematic interface diagram of a terminal device for audio acquisition according to an embodiment of the present invention;
FIG. 5 is a schematic interface diagram showing the time points at which the terminal device displays voiceprint information tagged with a sound generation target in a recorded video and audio file after searching the file;
FIG. 6 illustrates a flow diagram for viewing a contact media library by a terminal device in accordance with an embodiment of the present invention;
FIG. 7 illustrates a flow diagram for recording contact sounds according to an embodiment of the present invention;
FIG. 8 shows an overall structural schematic according to an embodiment of the invention;
fig. 9 shows a schematic structural diagram according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the specific embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept, object, concept, reference, and scope of the invention to those skilled in the art. The terminology used in the detailed description of the particular exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like numbering represents like elements.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As shown in fig. 1, the present invention provides a method for operating an audio/video file based on voiceprint information, comprising the following steps: s1, collecting voiceprint information of the sounding target; and S2, searching the audio/video file according to the voiceprint information.
For example, step S1 is implemented by the following method: when the contact X1 calls the user Y, the terminal equipment starts a built-in sound recorder to record a voice of the contact X1 speaking alone (for example, the recorded speaking voice is 7-10 seconds in length), and extracts voiceprint information from the voice; then, after the conversation is stopped, the terminal equipment generates a speaker model M1 according to the recorded voiceprint information, and then stores the sample in a media library; and then, the terminal equipment corresponds the speaker model to the directory of the contact X in the address book.
For example, step S1 is also implemented by the following method: when a user Y carries a son X2 to go to a park for playing, the terminal equipment starts a 'record voiceprint sample' option in the record of the son X2 in the address book and records the voiceprint information of the son X2; then, after the recording is stopped, the terminal equipment generates a speaker model M2 according to the recorded voiceprint information, and the sample is stored in a terminal memory; the terminal device then maps the speaker model to the file for contact X2 in the media library. Of course, it is understood that a media library is one representation of a collection of stored multimedia files, and may be represented as a folder, file manager, media manager, video manager, audio manager, and the like. As shown in fig. 5, when the voiceprint information including the speaker models M1 and M2 is encountered later, the terminal device classifies and marks these video and audio files according to specific objects (e.g., "i" and "son"). After the categories are stored, information such as a subject bar, a folder, a media library, etc. of the corresponding category may be generated.
Step S1 can also be implemented by: step S11, when a sound production target (for example, Zhang III) in the address book application program is selected, providing a option of recording the voiceprint sample on the display screen; step S12, when the user clicks the option of recording the voiceprint sample, the terminal device collects the voiceprint information and stores the speaker model generated according to the voiceprint information in the contact media library; and step S13, when the contact person media library page is entered, the display screen presents the searched audio/video files. Therefore, collecting voiceprint information of the sounding target includes: when a certain sounding target is selected, voice print information is collected; and storing the collected voiceprint information.
Fig. 2 is a schematic interface diagram before audio acquisition by a terminal device according to an embodiment of the present invention. Fig. 3 shows a flow diagram of audio acquisition according to an embodiment of the invention. The audio acquisition process comprises the following steps: step 101: and entering an address book and opening a specific contact person in the telephone book. Then, step 102: pressing the "record voiceprint sample" option (as shown in fig. 2), the contact's voice is recorded (i.e., voiceprint information for the contact is collected). Next, step 103: and after the recording is finished, modeling the voice of the contact to generate a speaker model, and storing the speaker model in the contact information. Thus, collecting and storing voiceprint information includes: generating a speaker model according to the voiceprint information; and storing the speaker model in a local storage module.
FIG. 4 illustrates a modeling process according to an embodiment of the invention. The technique of recognizing the identity of a Speaker using voiceprint information may be referred to as Speaker Recognition (SR), and a corresponding Model may be referred to as a Speaker Model (SM). The speaker recognition system usually adopts a UBM-GMM method to perform modeling, i.e. a Universal Background Model (UBM) is trained through a large amount of training audios (more than one speaker), and then a specific speaker is modeled through a self-adaptive method on the basis of the UBM, so as to obtain a Speaker Model (SM). Whether a general background Model or a speaker Model, a mixed Gaussian Model (GMM) structure is generally adopted.
Fig. 4 is a schematic interface diagram of a terminal device for audio acquisition according to an embodiment of the present invention. For example, when the terminal device records a voiceprint sample, the contact sound can be recorded by clicking a button for adding the recorded voiceprint sample under an interface (as shown in fig. 4) of the contact of the address book.
Further, as shown in fig. 3, the voiceprint recognition process includes the following steps: step 104: an audio/video file is determined. Then, step 105: and carrying out speaker segmentation on the voice in the audio/video file and generating n voice units, wherein each voice unit only contains single speaker voice. Next, step 106: and performing contact voice print recognition on each divided voice unit (for example, n voice units) and judging whether the voice units are matched. Next, step 107: and if the identification result is matched, establishing a database of the corresponding relation between the contact person and the audio/video file for the terminal equipment. Further, the database of correspondence may record audio/video files where the contact's voice appears. Further, the database of correspondence may also record the point in time at which the contact audio appears in the audio/video file. That is, the position where the audio/video appears in the corresponding file is mapped by the time point.
Fig. 6 shows a flow diagram for viewing a contact media library by a terminal device according to an embodiment of the invention. The process of viewing the contact media library through the terminal device may include the following steps: step 201: and opening a media library, and selecting to enter a 'contact media library' menu. Next, step 202: the contact and audio/video file relational database is started to be read. Next, step 203: after the reading is completed, the contact person and the corresponding media file and time point 203 are displayed.
Fig. 5 shows a schematic interface diagram of the terminal device displaying the time point at which the voiceprint information marked with the sound emission target appears and/or ends in the recorded video and audio file after the search. For example, the media library is opened and a selection is made to enter the "contacts media library" menu, at which point an interface is presented to the user to view the contacts media library. The interface provides various information after reading the relational database of the contact persons and the audio/video files. Accordingly, searching for an audio/video file according to the voiceprint information includes: when the local storage module is opened, the audio/video file is displayed.
Further, as can be seen from the interface shown in fig. 5, there are two types of media files "son" and "me" in the media library of this embodiment, where: there are three time points in the "six children section" project of the "son" file, namely 3 ' 45 ", 18 ' 23", 45 ' 34 ". These three time points are the time points at which the "son" sound appears in the "six children's day" item. For example, the user may select "3' 45", at which point the terminal device may automatically enter the "six-children festival" item to begin playing at 3 minutes and 45 seconds. Thus, storing the collected voiceprint information includes: and carrying out classified storage according to the speaker model. Further, the searching for the audio/video file according to the voiceprint information includes: when the local storage module is opened, the audio/video file is displayed. Further, the classifying includes: and classifying and displaying the audio/video files according to the speaker model. Further, the displaying includes: the time point at which the sound emission target appears in the audio/video file is displayed. Further, the classifying includes: and performing classified search on the audio/video files according to the types of the sound production targets. Further, the time points include: when a time point in the classified display is selected, audio/video of a sound emission target contained in the audio/video file is played.
As shown in fig. 1 to 6, according to another embodiment of the present invention, when the terminal device classifies the audio/video files according to specific contacts, it is first necessary to model and store voiceprints for the key contacts in the address book module. In the invention, a voiceprint sample field is added for each contact record in a terminal equipment address list module and is used for storing the voiceprint sample of the contact. The specific operation method comprises the following steps: the user creates or edits important contacts (e.g., "children") of their interest. Then, a piece of audio (e.g., recording normal speech for a length of 7-10 seconds) is recorded for that particular contact ("child"). The terminal device models the voiceprint of the specific contact (child) according to the voice sample and stores the voiceprint sample field of the contact record (child) in the address book. The user then records and saves the audio/video file on the terminal device. The method and the device can analyze the voice print of the important contact person, classify the important contact person according to the contact person and mark the object of the voice occurrence time point of the contact person. Then, the speaker segmentation technology is used to extract and segment the recorded voices of all speakers in the audio/video file into a plurality of voice units, wherein each voice unit only contains the voice of one of the speakers. Next, a voiceprint recognition is performed for each phonetic unit using the speaker model. And then, after the voiceprint is identified, storing a database of the relationship between the contact persons and the audio/video, wherein the database is used for recording the corresponding relationship between the contact persons and the audio/video file and the time points of the sound of the contact persons appearing in the audio/video file. The voiceprint mentioned in the invention refers to: the acoustic spectrum of a user's voice is the biometric characteristic of the user's voice. Through voiceprint comparison, the mobile terminal can find out the corresponding target in the stored multimedia. Therefore, when the utterance target is a contact in the contact application, the method of collecting voiceprint information of the utterance target includes: when a call is made with the contact person, recording a section of sound of the contact person, wherein the time length of the section of sound is 7-10 seconds or more, and only the sound of the contact person exists in the section of sound. The voiceprint information is extracted using the segment of sound and a voiceprint template is generated. Further, when the utterance target is a contact in the contact application, the collecting voiceprint information of the utterance target includes: when a call is made with the contact person, voiceprint information of the contact person is recorded. Further, when the utterance target is a contact in the contact application, the collecting voiceprint information of the utterance target includes: and the user manually records the voice of the contact person and records the voiceprint information of the contact person. Further, when the utterance is targeted for a contact in the contacts application, searching for the audio/video file includes: when the contact is selected, the audio/video of the mapped contact is played.
Fig. 7 shows a flow diagram for recording contact sounds according to an embodiment of the invention. The process of recording the sound of the contact person comprises the following steps: step 301: and opening a certain contact person in the address book. Next, step 302: it is determined whether the recording is the first recording.
When the judgment result is the first recording, the method proceeds to step 303: the recording is started. Next, step 304: and storing the audio after recording. Next, step 305: the audio is voice print modeled. Next, step 306: and saving the voiceprint modeling information. Next, step 307: the present voiceprint information is used to identify existing audio/video files. Next, step 308: and saving the identified files and time points to a contact person and audio/video relation database. Finally, step 309: and finishing the voiceprint recording work.
If the judgment result is not the first recording, step 310 is entered: and further judging whether the prompt is recorded again. If re-recording is required, step 311 is entered: and deleting the original sound recording file. After the original recording file is deleted, step 303 is performed. The above-described steps 303 to 309 are then performed in sequence. If re-recording is not required, recording is not performed and the process ends (309).
According to another embodiment of the present invention, a method for classifying and identifying videos and audios on a terminal device based on a voiceprint recognition technology includes one of the following steps: the contact voice is recorded to advance the voiceprint information. Then, the voice/video file is divided into a plurality of voice units, each voice unit only contains the voice of one speaker, and the voice units are subjected to voiceprint recognition one by one. The recognition result is then saved to the contacts and audio/video relational database. When entering a contact person media library, or when a user carries out operation of classifying according to contact persons or searching according to contact persons in any media library or file manager of terminal equipment, or directly checks related audios and videos of the contact persons in a contact person application program, reading a relational database of the contact persons and the audios and videos and displaying the relations of the contact persons and the audios and the videos. The invention can not only display the relation of the contact person and the audio/video in a certain menu item mode in the media library, but also display the relation in a menu mode in the contact person or the file manager.
Further, according to another embodiment of the invention, in the application programs such as the terminal device media library, the contact manager, the file manager and the like, the classification display and the search of the audio and the video are selected to be carried out according to the classification of the contact or according to the search of the contact. Further, according to another embodiment of the present invention, the audio/video associated with the contact may be viewed directly in the contact application.
Therefore, the method for operating the audio/video file based on the voiceprint information can classify the audio/video file according to the voiceprint information of the specific contact. Therefore, when a user wants to find the audio/video file containing the specific contact person, the user does not need to play and view one file, but directly selects the file through the media library, the contact person manager and the file manager display information, so that the user can conveniently find the file containing the sound or the video of the specific person. Furthermore, the method for operating the audio/video file based on the voiceprint information can directly jump to a time node of speaking of a certain contact in the audio/video for playing, so that the searching efficiency of a user is improved.
As shown in fig. 8, the technology of recognizing the identity of a Speaker by using voiceprint information according to the overall scheme of the present invention may be referred to as Speaker Recognition (SR), and the corresponding Model may be referred to as Speaker Model (Speaker Model, SM). The speaker recognition system usually adopts a UBM-GMM method to perform modeling, i.e. a Universal Background Model (UBM) is trained through a large amount of training audios (more than one speaker), and then a specific speaker is modeled through a self-adaptive method on the basis of the UBM, so as to obtain a Speaker Model (SM). Whether a general background Model or a speaker Model, a mixed Gaussian Model (GMM) structure is generally adopted. As shown in fig. 8, the method for operating an audio/video file based on voiceprint information provided by the present invention may include: modeling process and identification process. The modeling process may include the steps of: step 1: training audio; step 2: detecting silence; and step 3: voice segmentation; and 4, step 4: extracting characteristics; and 5: performing cross adaptation according to the general background model; step 6: generating a speaker model; and 7: performing Z-norm processing based on the counterfeiter audio; and 8: the speaker model is normalized. The identification process may include the steps of: step 1: detecting audio to be identified; step 2: detecting silence; and step 3: voice segmentation; and 4, step 4: extracting characteristics; and 5: calculating scores according to the normalized speaker model; step 6: performing T-norm processing based on the counterfeiter audio; and 7: judging; and 8: and outputting the recognition result. Wherein: the normalized speaker model and the counterfeiter model constitute a speaker model. According to one embodiment of the present invention, the modeling process of the speaker model can be roughly described as the following stages: 1. a characteristic extraction stage: detecting effective Voice from input audio by using a Voice Activity Detection (VAD) technology, dividing the input audio into a plurality of sentences of Voice according to the mute length between the voices, and extracting Voice characteristics required by speaker recognition from each sentence of Voice; 2. UBM modeling stage: calculating a Universal Background Model (UBM) using a plurality of speech features extracted from the training audio; 3. and (3) an SM modeling stage: calculating a Speaker Model (SM) by a self-adaptive method by using a general background model and the voice characteristics of a small number of specific speakers; 4. and SM normalization stage: in order to enhance the anti-interference capability of the speaker model, after the speaker model modeling is completed, the speaker model is often Normalized (Normalized) by using the speech characteristics of some fake speakers, and finally the Normalized speaker model (Normalized SM) is obtained. According to one embodiment of the present invention, the speaker recognition process can be roughly described as the following stages: 1. a characteristic extraction stage: this stage is the same as the feature extraction stage of the modeling process; 2. and a score calculation stage: calculating the score of the input voice characteristics by using the speaker model; 3. score normalization stage: and normalizing the score obtained in the last step by using the normalized speaker model, and making final judgment. Further, in the modeling and recognition processes described above, some steps may have different implementation methods: 1. silence detection technology in the feature extraction stage: the method includes the steps of firstly, distinguishing silence from non-silence by using energy information and fundamental frequency information of input audio, and then distinguishing voice and non-voice of the non-silence part by using a Support Vector Machine (SVM) model. Determining the voice part, and dividing the input audio into a plurality of sentences of voice according to the interval length between the voice sections; 2. the adaptive method for calculating the speaker model by utilizing the general background model comprises the following steps: the method is a method combining an Eigenvoice (Eigenvoice) method, a Constrained Maximum Likelihood Linear Regression (CMLLR) method and a Structured Maximum a posteriori probability (SMAP) method; 3. the speaker model normalization method comprises the following steps: the Z-Norm method is used in the present application; 4. the score normalization method comprises the following steps: the T-Norm method is used in the present application. The normalization method, which is a combination of Z-Norm and T-Norm methods, is currently the most popular normalization method in speaker recognition technology, the former being used in the modeling phase and the latter being used in the recognition phase.
As shown in fig. 9, another object of the present invention is to provide a terminal device, comprising: the voice print extraction module is used for collecting voice print information of a sounding target; and the execution module is used for searching the audio/video file according to the voiceprint information.
Further, the voiceprint extraction module comprises: the voice print information acquisition unit is used for acquiring voice print information when a certain sound production target is selected; and the voiceprint sample generating unit is used for generating the speaker model according to the voiceprint information.
Further, the apparatus further comprises: and the storage module is used for storing the acquired voiceprint information.
Further, the storage module is further configured to: the voiceprint template samples are stored.
Further, the voiceprint extraction module comprises: and the target classification unit is used for performing classification storage according to the speaker model.
Further, the apparatus further comprises: and a display for displaying the audio/video file when the local storage module is opened.
Further, the display is for: the audio/video files are classified and displayed based on the kind of the sound-producing object according to the object classification unit.
Further, the display is for: the time point at which the sound emission target appears in the audio/video file is displayed.
Further, the object classification unit is further configured to: and performing classified search on the audio/video files according to the types of the sound production targets.
Further, the execution module is further configured to: when a time point in the classified display is selected, audio/video of a sound emission target contained in the audio/video file is played.
Further, when the utterance is targeted for a contact in the contacts application, the voiceprint extraction module is to: when a call is made with the contact person, voiceprint information of the contact person is recorded.
Further, when the utterance is targeted for a contact in the contacts application, the voiceprint extraction module is to: and the user manually records the voice of the contact person and records the voiceprint information of the contact person.
Further, when the utterance is targeted for a contact in the contacts application, the execution module is further to: when the contact is selected, the audio/video of the mapped contact is played.
The method and the device provided by the invention can quickly search the file recorded with the voice or video of the specific personnel so as to improve the searching efficiency of the user.
Those skilled in the art will appreciate that the present invention may be directed to an apparatus for performing one or more of the operations described in the present application. The apparatus may be specially designed and constructed for the required purposes, or it may comprise any known apparatus in a general purpose computer selectively activated or reconfigured by a program stored in the general purpose computer. Such a computer program may be stored in a device (e.g., computer) readable medium, including, but not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, Random Access Memories (RAMs), Read Only Memories (ROMs), electrically programmable ROMs, electrically erasable ROMs (eproms), electrically erasable programmable ROMs (eeproms), flash memories, magnetic cards, or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a bus. A readable medium includes any mechanism for storing or transmitting information in a form readable by a device (e.g., a computer). For example, readable media includes Random Access Memory (RAM), Read Only Memory (ROM), magnetic disk storage media, optical storage media, flash memory devices, signals propagating in electrical, optical, acoustical or other forms (e.g., carrier waves, infrared signals, digital signals), and so on.
It will be appreciated by those skilled in the art that the present invention has been described above with reference to block diagrams and/or flowchart illustrations of methods, systems, and computer program products according to embodiments of the invention. It will be understood that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the methods specified in the block or blocks of the block diagrams and/or flowchart block or blocks.
Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in flows, which have been discussed in this disclosure, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present invention may also be alternated, modified, rearranged, decomposed, combined, or deleted.
Exemplary implementations of the present invention are disclosed in the accompanying drawings and the description. Although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the protection scope of the present invention. The scope of the invention is defined by the claims of the present application.

Claims (27)

1. A method for operating an audio/video file based on voiceprint information, comprising the steps of:
collecting voiceprint information of a sounding target; wherein the voiceprint information comprises a sonic frequency spectrum; and
and searching audio/video files according to the voiceprint information and the speaker model, and displaying the searched audio/video files marked with the sounding targets, wherein the speaker model is obtained by modeling a specific speaker through a trained general background model.
2. The method of claim 1, wherein the collecting voiceprint information of a sound production target comprises:
when a certain sounding target is selected, voice print information is collected; and
and storing the collected voiceprint information.
3. The method of claim 2, wherein collecting and storing voiceprint information comprises:
generating the speaker model according to the voiceprint information; and
and storing the speaker model in a local storage module.
4. The method of claim 3, wherein storing the collected voiceprint information comprises:
and carrying out classified storage according to the speaker model.
5. The method of claim 3, wherein displaying the searched audio/video file labeled with the sounding target comprises:
and when the local storage module is opened, displaying the searched audio/video file marked with the sounding target.
6. The method of claim 4, wherein the classifying comprises:
and classifying and displaying the audio/video files according to the speaker model.
7. The method of claim 1, wherein the displaying comprises:
displaying a time point at which the sound production target appears in an audio/video file;
wherein all recorded sounds in the audio/video file are divided into a plurality of voice units, each voice unit contains only the voice of one of the voice production targets, and the time points of the voice production targets in the audio/video file are recorded, and the positions of the audio/video appearing in the corresponding file are mapped by the time points.
8. The method of claim 4, wherein the classifying comprises:
and carrying out classified search on the audio/video files according to the types of the sounding targets.
9. The method of claim 7, wherein the time points comprise:
when the time point in the classified display is selected, the audio/video of the sound emission target contained in the audio/video file is played from the time point.
10. The method of claim 1, wherein when the utterance target is a contact in a contacts application, the collecting voiceprint information of the utterance target comprises:
and recording the voiceprint information of the contact person when the contact person is in a call.
11. The method of claim 1, wherein when the utterance target is a contact in a contacts application, the collecting voiceprint information of the utterance target comprises:
and manually recording the voice of the contact by the user and recording the voiceprint information of the contact.
12. The method of claim 1, wherein when the utterance target is a contact in a contacts application, the searching for an audio/video file comprises:
when the contact is selected, the audio/video mapping the contact is played.
13. A terminal device, comprising:
the voice print extraction module is used for collecting voice print information of a sounding target; wherein the voiceprint information comprises a sonic frequency spectrum;
the execution module is used for searching audio/video files according to the voiceprint information and the speaker model; the speaker model is obtained by modeling a specific speaker through a trained general background model;
and the display is used for displaying the searched audio/video file marked with the sounding target.
14. The terminal device of claim 13, wherein the voiceprint extraction module comprises:
the voice print information acquisition unit is used for acquiring voice print information when a certain sound production target is selected;
and the voiceprint sample generating unit is used for generating a speaker model according to the voiceprint information.
15. The terminal device according to claim 14, further comprising:
and the storage module is used for storing the acquired voiceprint information.
16. The terminal device of claim 15, wherein the storage module is further configured to: and storing the speaker model.
17. The terminal device according to claim 14 or 16, wherein the voiceprint extraction module comprises:
and the target classification unit is used for performing classification storage according to the speaker model.
18. The terminal device according to claim 15, wherein the display displays the searched audio/video file labeled with the sound emission target when the local storage module is opened.
19. The terminal device of claim 17, wherein the display is configured to:
and classifying and displaying the audio/video files according to the target classification unit based on the type of the sound production target.
20. The terminal device of claim 13, wherein the display is configured to:
displaying a time point at which the sound production target appears in an audio/video file; wherein all recorded sounds in the audio/video file are divided into a plurality of voice units, each voice unit contains only the voice of one of the voice production targets, and the time points of the voice production targets in the audio/video file are recorded, and the positions of the audio/video appearing in the corresponding file are mapped by the time points.
21. The terminal device of claim 17, wherein the target classification unit is further configured to:
and performing classified search on the audio/video files according to the types of the sound production targets.
22. The terminal device of claim 19, wherein the execution module is further configured to:
when a time point in the classified display is selected, the audio/video of the sound emission target contained in the audio/video file is played from the time point.
23. The terminal device of claim 13, wherein when the utterance target is a contact in a contacts application, the voiceprint extraction module is configured to:
and recording the voiceprint information of the contact person when the contact person is in a call.
24. The terminal device of claim 13, wherein when the utterance target is a contact in a contacts application, the voiceprint extraction module is configured to:
and manually recording the voice of the contact by the user and recording the voiceprint information of the contact.
25. The terminal device of claim 13, wherein when the utterance is targeted for a contact in a contacts application, the execution module is further configured to:
when the contact is selected, the audio/video mapping the contact is played.
26. An electronic device, comprising:
one or more processors;
a memory;
one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: method of performing an operation on an audio/video file based on voiceprint information according to any one of claims 1 to 12.
27. A computer readable storage medium storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of operating on an audio/video file based on voiceprint information as claimed in any one of claims 1 to 12.
CN201710439537.1A 2012-12-05 2012-12-05 Method and device for operating audio/video file based on voiceprint information Active CN107274916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710439537.1A CN107274916B (en) 2012-12-05 2012-12-05 Method and device for operating audio/video file based on voiceprint information

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710439537.1A CN107274916B (en) 2012-12-05 2012-12-05 Method and device for operating audio/video file based on voiceprint information
CN201210518118.4A CN103035247B (en) 2012-12-05 2012-12-05 Based on the method and device that voiceprint is operated to audio/video file

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201210518118.4A Division CN103035247B (en) 2012-12-05 2012-12-05 Based on the method and device that voiceprint is operated to audio/video file

Publications (2)

Publication Number Publication Date
CN107274916A CN107274916A (en) 2017-10-20
CN107274916B true CN107274916B (en) 2021-08-20

Family

ID=48022078

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201210518118.4A Active CN103035247B (en) 2012-12-05 2012-12-05 Based on the method and device that voiceprint is operated to audio/video file
CN201710439537.1A Active CN107274916B (en) 2012-12-05 2012-12-05 Method and device for operating audio/video file based on voiceprint information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201210518118.4A Active CN103035247B (en) 2012-12-05 2012-12-05 Based on the method and device that voiceprint is operated to audio/video file

Country Status (1)

Country Link
CN (2) CN103035247B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117665A (en) * 2013-08-14 2019-01-01 华为终端(东莞)有限公司 Realize method for secret protection and device
CN104123115B (en) * 2014-07-28 2017-05-24 联想(北京)有限公司 Audio information processing method and electronic device
CN104243934A (en) * 2014-09-30 2014-12-24 智慧城市信息技术有限公司 Method and device for acquiring surveillance video and method and device for retrieving surveillance video
TWI571120B (en) * 2014-10-06 2017-02-11 財團法人資訊工業策進會 Video capture system and video capture method thereof
CN104268279B (en) * 2014-10-16 2018-04-20 魔方天空科技(北京)有限公司 The querying method and device of corpus data
CN105828179A (en) * 2015-06-24 2016-08-03 维沃移动通信有限公司 Video positioning method and device
CN105022263B (en) * 2015-07-28 2018-03-27 广东欧珀移动通信有限公司 A kind of method and intelligent watch for controlling intelligent watch
CN106548793A (en) * 2015-09-16 2017-03-29 中兴通讯股份有限公司 Storage and the method and apparatus for playing audio file
CN105635452B (en) * 2015-12-28 2019-05-10 努比亚技术有限公司 Mobile terminal and its identification of contacts method
CN105654942A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter
CN106095764A (en) * 2016-03-31 2016-11-09 乐视控股(北京)有限公司 A kind of dynamic picture processing method and system
CN106448683A (en) * 2016-09-30 2017-02-22 珠海市魅族科技有限公司 Method and device for viewing recording in multimedia files
CN107452408B (en) * 2017-07-27 2020-09-25 成都声玩文化传播有限公司 Audio playing method and device
CN108305636B (en) * 2017-11-06 2019-11-15 腾讯科技(深圳)有限公司 A kind of audio file processing method and processing device
CN108074574A (en) * 2017-11-29 2018-05-25 维沃移动通信有限公司 Audio-frequency processing method, device and mobile terminal
CN108364663A (en) * 2018-01-02 2018-08-03 山东浪潮商用系统有限公司 A kind of method and module of automatic recording voice
CN108364654B (en) * 2018-01-30 2020-10-13 网易乐得科技有限公司 Voice processing method, medium, device and computing equipment
CN108319371A (en) * 2018-02-11 2018-07-24 广东欧珀移动通信有限公司 Control method for playing back and Related product
CN108920619A (en) * 2018-06-28 2018-11-30 Oppo广东移动通信有限公司 Document display method, device, storage medium and electronic equipment
CN109446356A (en) * 2018-09-21 2019-03-08 深圳市九洲电器有限公司 A kind of multimedia document retrieval method and device
CN111091844A (en) * 2018-10-23 2020-05-01 北京嘀嘀无限科技发展有限公司 Video processing method and system
CN111883139A (en) * 2020-07-24 2020-11-03 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for screening target voices
CN112153461B (en) * 2020-09-25 2022-11-18 北京百度网讯科技有限公司 Method and device for positioning sound production object, electronic equipment and readable storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3707115B2 (en) * 1995-11-17 2005-10-19 ヤマハ株式会社 Personal information usage system
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
CN1307589C (en) * 2001-04-17 2007-03-28 皇家菲利浦电子有限公司 Method and apparatus of managing information about a person
US8606579B2 (en) * 2010-05-24 2013-12-10 Microsoft Corporation Voice print identification for identifying speakers
CN102347060A (en) * 2010-08-04 2012-02-08 鸿富锦精密工业(深圳)有限公司 Electronic recording device and method
CN102404278A (en) * 2010-09-08 2012-04-04 盛乐信息技术(上海)有限公司 Song request system based on voiceprint recognition and application method thereof
CN102655002B (en) * 2011-03-01 2013-11-27 株式会社理光 Audio processing method and audio processing equipment
CN102238189B (en) * 2011-08-01 2013-12-11 安徽科大讯飞信息科技股份有限公司 Voiceprint password authentication method and system

Also Published As

Publication number Publication date
CN103035247B (en) 2017-07-07
CN103035247A (en) 2013-04-10
CN107274916A (en) 2017-10-20

Similar Documents

Publication Publication Date Title
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
US10977299B2 (en) Systems and methods for consolidating recorded content
Hanilci et al. Recognition of brand and models of cell-phones from recorded speech signals
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
WO2019148586A1 (en) Method and device for speaker recognition during multi-person speech
CN102394062B (en) Method and system for automatically identifying voice recording equipment source
US20160283185A1 (en) Semi-supervised speaker diarization
CN107507626B (en) Mobile phone source identification method based on voice frequency spectrum fusion characteristics
CN104485102A (en) Voiceprint recognition method and device
CN101923855A (en) Test-irrelevant voice print identifying system
CN109192224A (en) A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
US9058384B2 (en) System and method for identification of highly-variable vocalizations
CN107679196A (en) A kind of multimedia recognition methods, electronic equipment and storage medium
CN113782026A (en) Information processing method, device, medium and equipment
CN108665901B (en) Phoneme/syllable extraction method and device
CN109817223A (en) Phoneme notation method and device based on audio-frequency fingerprint
CN110556114B (en) Speaker identification method and device based on attention mechanism
CN109635151A (en) Establish the method, apparatus and computer equipment of audio retrieval index
CN109271480A (en) A kind of voice searches topic method and electronic equipment
CN110797032B (en) Voiceprint database establishing method and voiceprint identification method
CN113744742B (en) Role identification method, device and system under dialogue scene
CN112634942B (en) Method for identifying originality of mobile phone recording, storage medium and equipment
WO2014155652A1 (en) Speaker retrieval system and program
CN115954007B (en) Voiceprint detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant