WO2020154883A1 - 语音信息的处理方法、装置、存储介质及电子设备 - Google Patents

语音信息的处理方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2020154883A1
WO2020154883A1 PCT/CN2019/073642 CN2019073642W WO2020154883A1 WO 2020154883 A1 WO2020154883 A1 WO 2020154883A1 CN 2019073642 W CN2019073642 W CN 2019073642W WO 2020154883 A1 WO2020154883 A1 WO 2020154883A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
target
voice
voiceprint
voiceprint parameter
Prior art date
Application number
PCT/CN2019/073642
Other languages
English (en)
French (fr)
Inventor
叶青
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201980076330.XA priority Critical patent/CN113056784A/zh
Priority to PCT/CN2019/073642 priority patent/WO2020154883A1/zh
Publication of WO2020154883A1 publication Critical patent/WO2020154883A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification

Definitions

  • the present invention relates to the field of voice processing, in particular to a method, device, storage medium and electronic equipment for processing voice information.
  • the voice information processing method, device, storage medium, and electronic equipment provided in the embodiments of the present application can improve the accuracy of voice information processing.
  • an embodiment of the application provides a method for processing voice information, including:
  • the first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • an embodiment of the present application provides a voice information processing device, including:
  • the collecting unit is used to collect voice information of the target user, and extract the target voice feature information of the voice information
  • the input unit is used to input target voice feature information into a preset model to obtain target voiceprint parameters
  • the acquiring unit is configured to acquire the voice information to be recognized in the playing video, and extract the first voiceprint parameter of the voice information to be recognized;
  • the matching unit is configured to match the first voiceprint parameter with the target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and identify the identification information in the playback video.
  • the storage medium provided by the embodiment of the present application has a computer program stored thereon, and when the computer program runs on a computer, the computer is caused to execute the voice information processing method provided in any embodiment of the present application .
  • the electronic device includes a processor and a memory, the memory has a computer program, and the processor is used to execute the steps by calling the computer program:
  • the first voiceprint parameter is matched with the target voiceprint parameter, identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • FIG. 1 is a schematic flowchart of a method for processing voice information provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of another flow chart of a voice information processing method provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a voice information processing apparatus provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of another module of the voice information processing apparatus provided by an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of another structure of an electronic device provided by an embodiment of the application.
  • module used in this article can be regarded as a software object executed on the computing system.
  • the different components, modules, engines and services mentioned in this article can be regarded as the implementation objects on the computing system.
  • the device and method described herein are preferably implemented in the form of software, of course, they can also be implemented on hardware, and they are all within the protection scope of the present application.
  • the embodiment of the present application provides a method for processing voice information.
  • the execution subject of the method for processing voice information may be the voice information processing device provided in the embodiments of the present application, or an electronic device integrated with the voice information processing device, where The voice information processing device can be implemented in hardware or software.
  • the electronic device may be a smart phone, a tablet computer, a PDA (Personal Digital Assistant), etc.
  • the embodiment of the present invention provides a video and voice processing method, including:
  • the first voiceprint parameter is matched with the target voiceprint parameter, the identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • the preset model before the step of inputting target voice feature information into the preset model, it may further include: training background data through a preset algorithm to generate a preset containing common voice feature information corresponding to each target user.
  • the background data includes the voice information of each target user.
  • the step of inputting the target voice feature information into a preset model to obtain target voiceprint parameters may include: inputting the target voice feature information into the preset model to obtain the corresponding voice feature information Target difference feature information; determine the second voiceprint parameter according to the target difference feature information; perform channel compensation on the second voiceprint parameter to obtain the corresponding target voiceprint parameter.
  • the step of performing channel compensation on the second voiceprint parameter may include: performing channel compensation on the second voiceprint parameter by using a linear discriminant analysis method.
  • the step of matching the first voiceprint parameter with the target voiceprint parameter, and obtaining identification information of the matched target voiceprint parameter according to the matching result may include: The target voiceprint parameter is matched to generate a corresponding matching value; when the matching value is greater than the preset threshold, identification information of the matched target voiceprint parameter is obtained.
  • the step of obtaining the identification information of the matched target voiceprint parameters may include: sorting the matching values, obtaining the largest matching value among the matching values greater than a preset threshold, and according to the largest matching Value to obtain a matching target voiceprint parameter; obtain the corresponding identification information according to the target voiceprint parameter.
  • the step of identifying the identification information in the playing video includes: inputting the voice information to be recognized into a voice recognition model to generate corresponding text information.
  • the identification information is combined with the text information to generate caption information corresponding to the voice information to be recognized; the caption information is identified in the play video.
  • FIG. 1 is a schematic flowchart of a method for processing voice information provided by an embodiment of this application.
  • the method for processing voice information may include the following steps:
  • step S101 voice information of the target user is collected, and target voice feature information of the voice information is extracted.
  • the target user may refer to the main speaker in the video. It is understandable that in interviews, movies, variety shows and other types of videos, most of the speakers are concentrated in a limited number of roles. For example, in interview videos, the target users are the presenters and interview guests. In movies or TV series videos, the target users are actors with more weight in the movie, or in the music video of the idol group (Music Video, In the MV) video, the target users are all members of the idol group.
  • interview videos the target users are the presenters and interview guests.
  • the target users are actors with more weight in the movie, or in the music video of the idol group (Music Video, In the MV) video, the target users are all members of the idol group.
  • the target users are all members of the idol group.
  • the voice information of the target user refers to the marked voice information.
  • the voice information of the target user includes the identification information of the target user.
  • the identification information may refer to the identification information of the target user, such as personal information such as name, gender, age, title and so on.
  • the target voice feature information refers to the target voice voiceprint feature information. It is understandable that the vocal organs such as tongue, teeth, larynx, lungs, and nasal cavity used by people in speech vary greatly in size and shape. , Resulting in differences in each person’s voiceprint. Therefore, the voiceprint feature information is a unique feature of each person, just as each person has his own unique fingerprint. Further, the target voice feature information includes Mel-Frequency Cepstral Coefficient (MFCC) of the target voice information.
  • MFCC Mel-Frequency Cepstral Coefficient
  • the voice information in order to ensure the stability of the target voice feature information of the voice information, can be de-muted and denoised to generate processed voice information; the target voice feature information of the processed voice information is extracted and Use feature mean variance normalization and feature bending to process the target voice feature information.
  • step S102 the target voice feature information is input into a preset model to obtain target voiceprint parameters.
  • the preset model may refer to the Universal Background Model (UBM model for short).
  • the target voice feature information namely the target voiceprint feature
  • UBM model Universal Background Model
  • the target voiceprint parameter including the identification information of the target user.
  • Different target voiceprint parameters correspond to the identification information of different target users, that is, the target voiceprint parameters of each piece of voice information can determine the target user of each piece of voice information.
  • the process of inputting the target voice feature information into the preset model to obtain the target voiceprint parameters is the process of establishing the voiceprint model according to the target voiceprint parameters. It is understandable that different target voiceprint parameters correspond to voiceprint models of different target users.
  • the target voice feature information into the preset model may further include: training the background data through a preset algorithm to generate a preset containing common voice feature information corresponding to each target user.
  • the background data includes the voice information of each target user.
  • the step of inputting the target voice feature information into the preset model to obtain target voiceprint parameters may include:
  • the preset algorithm can be the EM algorithm.
  • the background data is trained through the EM algorithm, that is, the target speech feature information in the background data is trained to generate a general background model, and the UBM model is used to obtain the corresponding common Voice feature information.
  • the common voice feature information is the corresponding common voiceprint feature obtained from all target users.
  • the target voice feature information is input into the UBM model, the target difference feature information corresponding to the common voice feature information can be calculated according to the target voice feature information and the common voice feature information, and each voice is determined according to the target difference feature information
  • the second voiceprint parameter corresponding to the information where the second voiceprint parameter includes identification information of the target user. It is understandable that due to the uniqueness of the voiceprint, the target voice feature information of different target users is different, so the difference of each target voice feature information is amplified according to the target difference feature information corresponding to the common voice feature information. Therefore, compared with the target voice feature information, the target user information corresponding to each voice information can be determined more accurately according to the target difference feature information.
  • channel compensation for the second voiceprint parameter enables the second voiceprint parameter to minimize the intra-class difference and maximize the inter-class difference, so as to obtain a low-dimensional target voiceprint parameter that is easy to distinguish.
  • step S103 the voice information to be recognized in the playback video is obtained, and the first voiceprint parameter of the voice information to be recognized is extracted.
  • the manner of obtaining the to-be-recognized voice information in the playing video may include real-time obtaining the to-be-recognized voice information of the video being played or the live video, or obtaining the to-be-recognized voice information of the locally stored video.
  • the method of extracting the first voiceprint parameter of the voice information to be recognized is the same as the process of extracting the target voiceprint parameter, that is, extracting the first voice feature information in the voice information to be recognized, and inputting the first voice feature information into the preset
  • the first difference feature information corresponding to the common voice feature information is calculated according to the common voice feature information and the first voice feature information corresponding to each target user in the preset model
  • the first difference feature information is calculated according to the first difference feature
  • the information determines the first voiceprint parameter corresponding to the voice information to be recognized.
  • step S104 the first voiceprint parameter is matched with the target voiceprint parameter, the identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • the first voiceprint parameter is matched with the target voiceprint parameter, and the corresponding matching result is obtained.
  • the target voiceprint parameter matching the first voiceprint parameter can be determined, wherein, because each target voiceprint parameter Corresponding to the information of each target user respectively, that is, the target voiceprint parameter includes the identification information of the corresponding target user, so the information of the target user corresponding to the first voiceprint parameter can be confirmed according to the target voiceprint parameter.
  • the step of matching the first voiceprint parameter with the target voiceprint parameter, and obtaining identification information of the matched target voiceprint parameter according to the matching result may include:
  • the matching value is greater than the preset threshold, that is, the first voiceprint parameter is very similar to the matched target voiceprint parameter, and it can be determined that the speaker of the first voiceprint parameter corresponds to the matched target voiceprint parameter.
  • the target user is the same user, so the identification information of the matched target voiceprint parameter can be obtained as the identification information of the voice to be recognized corresponding to the first voiceprint parameter.
  • the step of identifying the identification information in the playing video may include:
  • the voice information to be recognized is input into the preset model to obtain the identification information
  • the voice to be recognized is input into the voice recognition model to obtain the text information
  • the time information corresponding to the text information and the identification information are recorded respectively, and the time information is converted according to the time information.
  • the identification information and the text information are combined to generate subtitle information of the voice information to be recognized, and the subtitle information is identified into the play video according to the time information.
  • the subtitle information may be identified to a preset position of the playing video in a preset combination, for example, the identification information and the subtitle information are combined side by side to mark the lower position of the playing video screen.
  • the identification information in the subtitle information is used to identify the first area of the playing video in a special form, and the identification information is used to identify the text information in a different form to the second area of the playing video.
  • the identification information is added to the upper end position of the playing video screen with a font size smaller than the text information, and the text information is added to the lower end position of the playing video screen.
  • the method for processing voice information extracts the target voice feature information of the voice information by collecting voice information of the target user; the target voice feature information is input into the preset model to obtain the target voice. Line parameters; obtain the voice information to be recognized in the playback video, and extract the first voiceprint parameter of the voice information to be recognized; match the first voiceprint parameter with the target voiceprint parameter, and obtain the match according to the matching result And identify the identification information of the target voiceprint parameter in the playback video.
  • the identification information of the target user such as the identity information
  • the identification information can be identified in the playing video, helping the user to better understand the content of the video when watching the video, so as to ensure the user experience.
  • the identification information can be automatically identified through voiceprint recognition technology. Adding to the playback video greatly reduces manual operations and saves labor costs.
  • FIG. 2 is a schematic diagram of another process of a voice information processing method provided by an embodiment of the application.
  • the method includes the following steps:
  • step S201 voice information of the target user is collected, and target voice feature information of the voice information is extracted.
  • the voice information of the target user refers to the marked voice information, so the voice information of the target user contains the identification information of the target user.
  • the identification information may refer to the identification information of the target user, such as name, gender, age , Title, etc. personal information.
  • the target voice feature information refers to the voiceprint feature information of the target voice. Since the voiceprint feature information is unique to each person, the user information corresponding to the voice information can be distinguished according to the voiceprint feature.
  • the voice information in order to ensure the stability of the target voice feature information of the voice information, can be de-muted and denoised to generate processed voice information; the target voice feature information of the processed voice information is extracted and Use feature mean variance normalization and feature bending to process the target voice feature information.
  • step S202 the background data is trained by a preset algorithm to generate a preset model including common voice feature information corresponding to each target user, and the background data includes voice information of each target user.
  • the preset algorithm can be the EM algorithm.
  • the background data is trained through the EM algorithm, that is, the target speech feature information in the background data is trained to generate a general background model, and the UBM model is used to obtain the corresponding common Voice feature information.
  • the common voice feature information is the corresponding common voiceprint feature obtained from all target users.
  • step S203 the target voice feature information is input into a preset model to obtain target difference feature information corresponding to the common voice feature information.
  • each segment of voice target feature information is input into the preset model.
  • the target difference feature information can be obtained based on the target voice feature information corresponding to each segment of voice and the common voice feature information of all target users obtained in step S202.
  • step S204 the second voiceprint parameter is determined according to the target difference feature information
  • the target difference feature information is transformed through the Total Variability Space (TVS)-based model to obtain the second voiceprint parameter.
  • TVS Total Variability Space
  • the full factor matrix of the full factor space can be estimated by the EM algorithm.
  • step S205 a linear discriminant analysis method is used to perform channel compensation on the second voiceprint parameter to obtain the corresponding target voiceprint parameter.
  • LDA linear discriminant analysis
  • LDA uses label information to find the optimal projection direction, so that the projected sample set Has the smallest intra-class difference and the largest inter-class difference.
  • the vector of voiceprint parameters of the same speaker represents a class.
  • the smallest intra-class difference is to reduce the change caused by the channel, and the maximum inter-class difference is to increase the difference information between speakers.
  • the method of linear discriminant analysis can obtain easily distinguishable low-dimensional target voiceprint parameters.
  • the process of obtaining the target voiceprint parameters according to the target voice feature information at this time is the process of establishing the corresponding voiceprint model.
  • the voiceprint model is the i-vector voiceprint model corresponding to each target user.
  • step S206 the voice information to be recognized in the playback video is obtained, and the first voiceprint parameter of the voice information to be recognized is extracted.
  • the voice feature information of the voice information to be recognized is extracted, and the voice feature information is input into the voiceprint model in step S205, and the corresponding target difference feature information is obtained according to the common voice feature information in the UBM model; and the first voice feature information Enter the preset model, and calculate the first difference feature information corresponding to the common voice feature information according to the common voice feature information and the first voice feature information corresponding to each target user in the preset model, and calculate the first difference feature information corresponding to the common voice feature information according to the first
  • the difference feature information determines the first voiceprint parameter corresponding to the voice information to be recognized; and channel compensation is performed on the first voiceprint parameter to obtain the processed first voiceprint parameter.
  • the step of extracting the first voiceprint parameter of the voice information to be recognized is the same as the step of extracting the target voiceprint parameter described above, and will not be repeated here.
  • step S207 the first voiceprint parameter is matched with the target voiceprint parameter to generate a corresponding matching value.
  • the first voiceprint parameters are respectively matched with the target voiceprint parameters of the target user to generate corresponding matching values.
  • step S208 when the matching value is greater than the preset threshold, the matching values are sorted, the maximum matching value among the matching values greater than the preset threshold is obtained, and the matched target voiceprint parameter is obtained according to the maximum matching value.
  • the matching value is greater than the preset threshold value such as 0.8, that is, the first voiceprint parameter is successfully matched with the corresponding target voiceprint parameter at this time, and it can be regarded as the target user and target corresponding to the first voiceprint parameter corresponding to the matching value.
  • the target user corresponding to the voiceprint parameter is likely to be the same user. If there are multiple matching values greater than the preset threshold, the matching values greater than the preset threshold are sorted to obtain the largest matching value. At this time, it is determined that the target user corresponding to the first voiceprint parameter corresponding to the maximum matching value and the target user corresponding to the target voiceprint parameter are likely to be the same person, and the target voiceprint parameter corresponding to the maximum matching value is obtained.
  • the matching value when the matching value is less than the preset threshold, it means that the first voiceprint parameter does not match the target voiceprint parameter, that is, the speaker corresponding to the voice to be recognized does not match the target user in the model. At this time, the voiceprint model input does not match the matching result.
  • step S209 corresponding identification information is obtained according to the target voiceprint parameters.
  • the target voiceprint parameter contains the identification information of the target user, the corresponding identification information can be obtained according to the successfully matched target voiceprint parameter.
  • step S210 input the voice information to be recognized into the voice recognition model to generate corresponding text information.
  • the voice information to be recognized is input into the voiceprint model to obtain identification information
  • the voice information to be recognized is simultaneously input into the voice recognition model to obtain text information.
  • step S211 the identification information is combined with the text information to generate caption information corresponding to the voice information to be recognized.
  • the identification information and the text information are obtained, the text information and the time information corresponding to the identification information are recorded respectively, and the identification information and the text information are combined according to the time information to generate caption information of the voice information to be recognized.
  • step S212 the subtitle information is identified in the play video.
  • the subtitle information is identified to the preset area in the playback video according to the time information, so as to ensure that the subtitle information is synchronized with the voice information in the playback video.
  • the method for processing voice information provided by this embodiment extracts target voice feature information of the voice information by collecting voice information of a target user;
  • the target voice feature information into the preset model to obtain the target voiceprint parameters; obtain the voice information to be recognized in the playback video, and extract the first voiceprint parameters of the voice information to be recognized; and set the first voiceprint parameters Matching with the target voiceprint parameter, obtaining identification information of the matched target voiceprint parameter according to the matching result, and identifying the identification information in the playback video.
  • the identification information of the target user such as the identity information
  • the use of voice recognition and voiceprint recognition technology to automatically add subtitle information to the video can greatly reduce manual labeling operations and save labor costs.
  • the embodiment of the present application also provides an apparatus based on the foregoing voice information processing method.
  • the meaning of the noun is the same as in the above-mentioned voice information processing method, and the specific implementation details can refer to the description in the method embodiment.
  • the embodiment of the present invention provides a video and voice processing device, including:
  • the collection unit is used to collect voice information of the target user, and extract the target voice feature information of the voice information
  • the input unit is used to input target voice feature information into a preset model to obtain target voiceprint parameters
  • the acquiring unit is configured to acquire the voice information to be recognized in the playback video, and extract the first voiceprint parameter of the voice information to be recognized;
  • the matching unit is configured to match the first voiceprint parameter with the target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and identify the identification information in the playback video.
  • the device may further include: a training unit, configured to train background data through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, the background data including Voice information of each target user.
  • a training unit configured to train background data through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, the background data including Voice information of each target user.
  • the input unit may include: an input subunit for inputting the target voice feature information into a preset model to obtain target difference feature information corresponding to the common voice feature information; and a determining subunit for The second voiceprint parameter is determined according to the target difference feature information; the processing subunit is used to perform channel compensation on the second voiceprint parameter to obtain the corresponding target voiceprint parameter.
  • the matching unit may include: a matching sub-unit for matching the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value; and an acquiring sub-unit for when the matching value is greater than the predetermined value.
  • the threshold is set, the identification information of the matched target voiceprint parameter is obtained.
  • the matching unit may further include: a generating subunit, used to input the to-be-recognized voice information into a voice recognition model to generate corresponding text information; and the combining subunit, used to combine the identification information with the text The information is combined to generate caption information corresponding to the voice information to be recognized; the identification subunit is used to identify the caption information to the play video.
  • the voice information processing device 300 includes: a collection unit 31, an input unit 32, an acquisition unit 33, and a matching unit 34.
  • the collection unit 31 is used to collect voice information of the target user, and extract target voice feature information of the voice information.
  • the voice information of the target user collected by the collecting unit 31 refers to the marked voice information, so the voice information of the target user contains the identification information of the target user.
  • the identification information may refer to the identity information of the target user, for example Personal information such as name, gender, age, title, etc.
  • the target voice feature information extracted by the collection unit 31 refers to the target voice voiceprint feature information. It is understandable that the vocal organs such as tongue, teeth, larynx, lungs, and nasal cavity used in speech vary in size and shape. Individuals are very different, resulting in differences in each person's voiceprint. Therefore, the voiceprint feature information is a unique feature of each person, just as each person has his own unique fingerprint. Further, the voiceprint feature information can be represented by Mel-Frequency Cepstral Coefficient (MFCC).
  • MFCC Mel-Frequency Cepstral Coefficient
  • the input unit 32 is used to input target voice feature information into a preset model to obtain target voiceprint parameters.
  • the input unit 32 inputs the target voice feature information of the voice information of the target user into the preset model to obtain the adjusted voice feature information corresponding to the voice information.
  • the input unit 32 can determine the corresponding target voiceprint parameters according to the adjusted voice feature information and the common voice feature information.
  • the acquiring unit 33 is configured to acquire the voice information to be recognized in the playing video, and extract the first voiceprint parameter of the voice information to be recognized.
  • the method for acquiring the voice information to be recognized in the playing video in the obtaining unit 33 may include obtaining the voice information to be recognized in the video being played or the live video in real time, or obtaining the voice information to be recognized in the locally stored video.
  • the step of the acquiring unit 33 extracting the first voiceprint parameter of the voice information to be recognized is the same as the step of acquiring the target voiceprint parameter through the input unit 32.
  • the matching unit 34 is configured to match the first voiceprint parameter with the target voiceprint parameter, obtain identification information of the matched target voiceprint parameter according to the matching result, and identify the identification information in the playback video.
  • the matching unit 34 matches the first voiceprint parameter with the target voiceprint parameter, and obtains a corresponding matching result. According to the matching result, the target voiceprint parameter matching the first voiceprint parameter can be determined, wherein, since each target The voiceprint parameters respectively correspond to the information of each target user, that is, the target voiceprint parameters include the identification information of the corresponding target user, so the information of the target user corresponding to the first voiceprint parameter can be confirmed according to the target voiceprint parameter.
  • the voice information processing device 300 may further include: a training unit 35, configured to train background data through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, the background data including Voice information of each target user.
  • a training unit 35 configured to train background data through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, the background data including Voice information of each target user.
  • the input unit 32 may include: an input subunit 321 for inputting the target voice feature information into a preset model to obtain target difference feature information corresponding to the common voice feature information; and a determining subunit 322 for inputting the target voice feature information according to The target difference feature information determines the second voiceprint parameter; the processing subunit 323 is configured to perform channel compensation on the second voiceprint parameter to obtain the corresponding target voiceprint parameter.
  • the matching unit 34 may include: a matching sub-unit 341, configured to match the first voiceprint parameter with the target voiceprint parameter to generate a corresponding matching value; and the obtaining sub-unit 342, configured to use when the matching value is greater than a preset When the threshold is used, the identification information of the matched target voiceprint parameter is obtained.
  • the generating subunit 343 is used to input the voice information to be recognized into the voice recognition model to generate corresponding text information; the combining subunit 344 is used to combine the identification information with the text information to generate the voice information to be recognized Corresponding subtitle information; identification subunit 345, used to identify the subtitle information to the play video.
  • the embodiment of the present application also provides an electronic device.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 is electrically connected to the memory 502.
  • the processor 500 is the control center of the electronic device 500. It uses various interfaces and lines to connect the various parts of the entire electronic device. Various functions of the electronic device 500 and data are processed, thereby overall monitoring of the electronic device 500 is performed.
  • the memory 502 can be used to store software programs and modules.
  • the processor 501 executes various functional applications and data processing by running the computer programs and modules stored in the memory 502.
  • the memory 502 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, a computer program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc.
  • the memory 502 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
  • the processor 501 in the electronic device 500 will load the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and will be run by the processor 501 and stored in the memory 502.
  • the processor 501 in the electronic device 500 will load the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and will be run by the processor 501 and stored in the memory 502.
  • the first voiceprint parameter is matched with the target voiceprint parameter, the identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • the processor 501 may also specifically perform the following steps:
  • the background data is trained through a preset algorithm to generate a preset model containing common voice feature information corresponding to each target user, and the background data includes voice information of each target user.
  • the processor 501 may specifically perform the following steps:
  • the linear discriminant analysis method is used to perform channel compensation on the second voiceprint parameter to obtain the corresponding target voiceprint parameter.
  • the processor 501 may specifically perform the following steps:
  • the matching value is greater than the preset threshold, the identification information of the matched target voiceprint parameter is obtained.
  • the processor 501 may specifically execute the following steps when acquiring the identification information of the matched target voiceprint parameter:
  • Sorting the matching values obtaining a maximum matching value among matching values greater than a preset threshold, and obtaining a matching target voiceprint parameter according to the maximum matching value;
  • the processor 501 may specifically execute the following steps:
  • the subtitle information is identified in the play video.
  • the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power supply 506.
  • the display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.
  • the display 503 may be used to display information input by the user or information provided to the user, and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
  • the display 503 may include a display panel, and in some embodiments, the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the radio frequency circuit 504 can be used to send and receive radio frequency signals to establish wireless communication with network equipment or other electronic equipment through wireless communication, and to send and receive signals with the network equipment or other electronic equipment.
  • the audio circuit 505 can be used to provide an audio interface between the user and the electronic device through a speaker or a microphone.
  • the power supply 506 can be used to power various components of the electronic device 500.
  • the power supply 506 may be logically connected to the processor 501 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the electronic device 500 may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the embodiments of the present application also provide a storage medium that stores a computer program, and when the computer program runs on a computer, the computer is caused to execute the voice information processing method in any of the above embodiments, such as: collection target The user’s voice information, extract the target voice feature information of the voice information; input the target voice feature information into the preset model to obtain target voiceprint parameters; obtain the voice information to be recognized in the playback video, and extract the voice to be recognized
  • the first voiceprint parameter of the information the first voiceprint parameter is matched with the target voiceprint parameter, the identification information of the matched target voiceprint parameter is obtained according to the matching result, and the identification information is identified in the playback video.
  • the storage medium may be a magnetic disk, an optical disc, a read only memory (Read Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the voice information processing method of the embodiment of the present application ordinary testers in the field can understand that all or part of the process of the voice information processing method of the embodiment of the present application can be controlled by a computer program.
  • the computer program can be stored in a computer readable storage medium, such as stored in the memory of an electronic device, and executed by at least one processor in the electronic device.
  • the execution process can include audio
  • the storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.
  • the voice information processing device of the embodiment of the present application its functional modules may be integrated in one processing chip, or each module may exist alone physically, or two or more modules may be integrated in one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk.

Abstract

本实施例公开了一种语音信息的处理方法,该方法包括采集语音信息,提取出目标语音特征信息并输入至预设模型中,以得到目标声纹参数,获取待识别语音,并提取出待识别语音的第一声纹参数,将第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取标识信息,并标识信息标识至所述播放视频中。提升了语音信息的处理准确性。

Description

语音信息的处理方法、装置、存储介质及电子设备 技术领域
本发明涉及语音处理领域,特别涉及一种语音信息的处理方法、装置、存储介质及电子设备。
背景技术
随着信息技术的发展,用户使用的数据早已不局限于文本与图片,其中视频已成为信息传输中的主要媒介。
目前,为了帮助用户更好的理解视频的内容,利用语音合成技术在视频中添加字幕已经成为常规的选择,同时在视频中添加字幕也能加速不同语言视频之间的分享。但是,现有添加的字幕仅带有语音中的文字内容,导致在一些视频中,仅仅依靠文字内容难以判断说话人的身份从而影响用户对视频内容的理解。
发明内容
本申请实施例提供的一种语音信息的处理方法、装置、存储介质及电子设备,可以提升语音信息的处理准确性。
第一方面,本申请实施例了提供了一种语音信息的处理方法,包括:
采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
第二方面,本申请实施例了提供了的一种语音信息的处理装置,包括:
采集单元,用于采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
输入单元,用于将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取单元,用于获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
匹配单元,用于将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
第三方面,本申请实施例提供的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如本申请任一实施例提供的语音信息的处理方法。
第四方面,本申请实施例提供的电子设备,包括处理器和存储器,所述存储器有计算机程序,所述处理器通过调用所述计算机程序,用于执行步骤:
采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
附图说明
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其它有益效果显而易见。
图1是本申请实施例提供的语音信息的处理方法的流程示意图。
图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。
图3为本申请实施例提供的语音信息的处理装置的模块示意图。
图4为本申请实施例提供的语音信息的处理装置的另一模块示意图。
图5为本申请实施例提供的电子设备的结构示意图。
图6为本申请实施例提供的电子设备的另一结构示意图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
本文所使用的术语「模块」可看做为在该运算系统上执行的软件对象。本文该的不同组件、模块、引擎及服务可看做为在该运算系统上的实施对象。而本文该的装置及方法优选的以软件的方式进行实施,当然也可在硬件上进行实施,均在本申请保护范围之内。
本申请实施例提供一种语音信息的处理方法,该语音信息的处理方法的执行主体可以是本申请实施例提供的语音信息的处理装置,或者集成了该语音信息的处理装置的电子设备,其中该语音信息的处理装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑(PDA,Personal Digital Assistant)等。
以下进行具体分析说明。
本发明实施例提供一种视频语音的处理方法,包括:
采集目标用户的语音信息,提取出该语音信息的目标语音特征信息;
将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数;
将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。
在一种实施方式中,将目标语音特征信息输入预设模型的步骤之前,还可以包括:通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,该背景数据包括每一目标用户的语音信息。
在一种实施方式中,将目标语音特征信息输入预设模型,以得到目标声纹参数的步骤,可以包括:将该目标语音特征信息输入预设模型,以得到与该共同语音特征信息相应的目标差异特征信息;根据该目标差异特征信息确定出第二声纹参数;对该第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
在一种实施方式中,对该第二声纹参数进行信道补偿的步骤,可以包括:利用线性鉴别分析的方法对该第二声纹参数进行信道补偿。
在一种实施方式中,将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息的步骤,可以包括:将该第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
在一种实施方式中,获取相匹配的目标声纹参数的标识信息的步骤,可以包括:将该匹配值进行排序处理,获取大于预设阈值的匹配值中的最大匹配值,根据该最大匹配值获取相匹配的目标声纹参数;根据该目标声纹参数获取该相应的标识信息。
在一种实施方式中,将该标识信息标识至该播放视频中的步骤,包括:将该待识别语音信息输入语音识别模型,以生成相应的文本信息。将该标识信息与该文本信息相结合,以生成该待识别语音信息相应的字幕信息;将该字幕信息标识至该播放视频中。
本申请实施例提供一种语音信息的处理方法,如图1所示,图1为本申请实施例提供的语音信息的处理方法的流程示意图,该语音信息的处理方法可以包括以下步骤:
在步骤S101中,采集目标用户的语音信息,提取出语音信息的目标语音特征信息。
其中,目标用户可以是指视频中的主要说话人,可以理解的是,在访谈、电影、综艺节目等类型的视频中,绝大部分情况下的说话人都集中在有限个数的角色中。例如,在访谈类的视频中,目标用户即为主持人以及访谈嘉宾,在电影或电视剧类的视频中,目标用户即为戏份权重较大的演员,或者在偶像组合的音乐短片(Music Video,MV)视频中,目标用户即为偶像组合的所有成员。
其中,目标用户的语音信息是指经过标注后的语音信息,此时目标用户的语音信息中包含目标用户的标识信息。进一步的,标识信息可以是指目标用户的身份信息,例如姓名、性别、年龄、称号等等个人信息。同时,目标语音特征信息指目标语音声纹特征信息,可以理解的是,因为人在讲话时使用的发声器官如:舌、牙齿、喉头、肺、鼻腔在尺寸和形态方面每个人的差异很大,导致每个人的声纹均存在差异。故声纹特征信息是每个人特有的特征,如同每个人有自己独一无二的指纹一样。进一步的,目标语音特征信息包括目标语音信息的梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient,MFCC)。
在一些实施方式中,为保证语音信息的目标语音特征信息的稳定性可以对语音信息进行去静音与去噪声的处理,生成处理后的语音信息;提取处理后的语音信息的目标语音特征信息并使用特征均值方差归一化与特征弯曲对目标语音特征信息进行处理。
在步骤S102中,将目标语音特征信息输入预设模型,以得到目标声纹参数。
其中,预设模型可以指通用背景模型(Universal Background Model,简称UBM模型),通过将目标语音特征信息即目标声纹特征输入UBM模型,以得到包含目标用户的标识信息的目标声纹参数,其中,不同的目标声纹参数对应着不同的目标用户的标识信息,即通过每一段语音信息的目标声纹参数可以确定每一段语音信息的目标用户。同时,若不同的语音片段输出的目标声纹参数相同时,即可认定不同的语音片段的说话人为同一用户。另外,将目标语音特征信息输入预设模型,以得到目标声纹参数的过程即为根据目标声纹参数建立声纹模型的过程。可理解的是,不同的目标声纹参数分别对应不同目标用户的声纹模型。
在一些实施方式中,在将目标语音特征信息输入预设模型的步骤之前,还可以包括:通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,该背景数据包括每一目标用户的语音信息。
此时,将目标语音特征信息输入预设模型,以得到目标声纹参数的步骤,可以包括:
(1)将该目标语音特征信息输入预设模型,以得到与该共同语音特征信息相应的目标差异特征信息;
(2)根据该目标差异特征信息确定出第二声纹参数;
(3)对该第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
其中,预设算法可以为EM算法,通过EM算法对背景数据进行训练,即对背景数据中的目标语音特征信息进行训练,以生成通用背景模型,并通过UBM模型获取每一目标用户相应的共同语音特征信息,此时,共同语音特征信息即为根据所有目标用户获取的对应的共同声纹特征。
进一步的,将目标语音特征信息输入UBM模型,根据目标语音特征信息与共同语音特征信息可计算得出与该共同语音特征信息相应的目标差异特征信息,并根据目标差异特征信息确定出每一语音信息对应的第二声纹参数,其中第二声纹参数中包含目标用户的标识信息。可理解的是,由于声纹的独特性,不同目标用户的目标语音特征信息是不同的,故根据获取与共同语音特征信息相应的目标差异特征信息来放大每一目标语音特征信息的差异性,从而相比于目标语音特征信息,根据目标差异特征信息能更准确确定每一语音信息对应的目标用户信息。
另外,由于背景数据中的语音信息与待识别的语音信息采集于不同的传输信道,导致存在很大的信道差异,从而导致识别性能下降,影响识别率。故对第二声纹参数进行信道补偿使得第二声纹参数能够最小化类内差异,最大化类间差异,以得到易区分的低维目标声纹参数。
在步骤S103中,获取播放视频中的待识别语音信息,并提取出待识别语音信息的第一声纹参数。
其中,获取播放视频中的待识别语音信息的方式可以包括实时获取正在播放的视频或者直播视频的待识别语音信息、或者获取本地存储的视频的待识别语音信息。另外,提取待识别语音信息的第一声纹参数的方法与上述提取目标声纹参数的过程相同,即为提取待识别语音信息中的第一语音特征信息,并将第一语音特征信息输入预设模型中,并根据预设模型中的每一目标用户对应的共同语音特征信息与第一语音特征信息计算得出与该共同语音特征信息相应的第一差异特征信息,并根据第一差异特征信息确定出待识别语音信息对应的第一声纹参数。
在步骤S104中,将第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将标识信息标识至播放视频中。
其中,将第一声纹参数与目标声纹参数进行匹配,并得到相应的匹配结果,根据匹配结果可确定第一声纹参数相匹配的目标声纹参数,其中,由于每一目 标声纹参数分别对应着每一目标用户的信息,即目标声纹参数中包含相应的目标用户的标识信息,故可根据目标声纹参数确认第一声纹参数对应的目标用户的信息。
在一些实施方式中,将第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息的步骤,可以包括:
(1)将该第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;
(2)当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
其中,当匹配值大于预设阈值时,即第一声纹参数与相匹配的目标声纹参数相似度极高可认定第一声纹参数的说话人与相匹配的目标声纹参数所对应的目标用户为同一用户,故可获取相匹配的目标声纹参数的标识信息作为第一声纹参数对应的待识别语音的标识信息。
另外,在一些实施方式中,将标识信息标识至播放视频中的步骤,可以包括:
(1.1)将该待识别语音信息输入语音识别模型,以生成相应的文本信息;
(2.1)将该标识信息与该文本信息相结合,以生成该待识别语音信息相应的字幕信息;
(3.1)将该字幕信息标识至该播放视频中。
其中,在将待识别语音信息输入至预设模型获取标识信息时,同步将待识别语音输入至语音识别模型获取文本信息,分别记录文本信息与标识信息所对应的时间信息,并根据时间信息将标识信息与文本信息相结合生成待识别语音信息的字幕信息,并根据时间信息将字幕信息标识至播放视频中。
在一些实施方式中,可以将字幕信息以预设组合方式标识至播放视频的预设位置,例如,将标识信息与字幕信息并排组合标识至播放视频画面的下方位置。或者将字幕信息中的标识信息以特殊形式标识播放视频的第一区域,同时将标识信息以不同形式将文本信息标识至播放视频的第二区域。例如,将标识信息以小于文本信息的字号添加至播放视频画面的上端位置,将文本信息添加至播放视频画面的下端位置。
由上述可知,本实施例提供的一种语音信息的处理方法,通过采集目标用户的语音信息,提取出该语音信息的目标语音特征信息;将目标语音特征信息输入预设模型,以得到目标声纹参数;获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数;并将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。以此可以实现将目标用户的标识信息例如身份信息标识至播放视频中,帮助用户在观看视频时能更好的理解视频的内容,以保 证用户体验,同时通过声纹识别技术自动的将标识信息添加至播放视频中,大大减少了人工操作,节省了人力成本。
根据上述实施例所描述的方法,以下将举例作进一步详细说明。
请参阅图2,图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。
具体而言,该方法包括以下步骤:
在步骤S201中,采集目标用户的语音信息,提取出语音信息的目标语音特征信息。
其中,目标用户的语音信息是指经过标注后的语音信息,故目标用户的语音信息中包含目标用户的标识信息,进一步的,标识信息可以是指目标用户的身份信息,例如姓名、性别、年龄、称号等等个人信息。另外,目标语音特征信息指目标语音的声纹特征信息,由于声纹特征信息是每个人特有的特征,故可以根据声纹特征来区分语音信息对应的用户信息。
在一些实施方式中,为保证语音信息的目标语音特征信息的稳定性可以对语音信息进行去静音与去噪声的处理,生成处理后的语音信息;提取处理后的语音信息的目标语音特征信息并使用特征均值方差归一化与特征弯曲对目标语音特征信息进行处理。
在步骤S202中,通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,背景数据包括每一目标用户的语音信息。
其中,预设算法可以为EM算法,通过EM算法对背景数据进行训练,即对背景数据中的目标语音特征信息进行训练,以生成通用背景模型,并通过UBM模型获取每一目标用户相应的共同语音特征信息,此时,共同语音特征信息即为根据所有目标用户获取的对应的共同声纹特征。
在步骤S203中,将目标语音特征信息输入预设模型,以得到与共同语音特征信息相应的目标差异特征信息。
其中,将每一段语音目标特征信息输入预设模型,此时,根据每一段语音对应的目标语音特征信息与步骤S202得到的所有目标用户的共同语音特征信息可得出目标差异特征信息。
在步骤S204中,根据目标差异特征信息确定出第二声纹参数;
其中,将目标差异特征信息通过全因子空间(Total Variability Space(TVS)-based model)的变换,可以得到第二声纹参数。其中,可以通过EM算法估计全因子空间的全因子矩阵。
在步骤S205中,利用线性鉴别分析的方法对第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
其中,为了减少信道差异造成的识别精度下降问题,可以利用线性鉴别分析(LDA)的方法进行信道补偿,需要说明的是,LDA使用标签信息来寻找最优的投影方向,使得投影后的样本集具有最小的类内差异和最大的类间差异。当应用于声纹识别时,同一说话人的声纹参数的矢量代表一个类,最小类内差异就是减少信道引起的变化,最大化类间差异就是增大说话人之间的差异信息,从而经过线性鉴别分析的方法可以得到易区分的低维目标声纹参数。
另外,此时根据目标语音特征信息获取目标声纹参数的过程即为建立相应声纹模型的过程,此时声纹模型分别为每一目标用户对应的i-vector声纹模型。
在步骤S206中,获取播放视频中的待识别语音信息,并提取出待识别语音信息的第一声纹参数。
其中,提取待识别语音信息的语音特征信息,并将语音特征信息输入步骤S205中声纹模型,并根据UBM模型中的共同语音特征信息获取相应的目标差异特征信息;并将第一语音特征信息输入预设模型中,并根据预设模型中的每一目标用户对应的共同语音特征信息与第一语音特征信息计算得出与该共同语音特征信息相应的第一差异特征信息,并根据第一差异特征信息确定出待识别语音信息对应的第一声纹参数;并对该第一声纹参数进行信道补偿得到处理后的第一声纹参数。其中提取出待识别语音信息的第一声纹参数的步骤与上述提取目标声纹参数的步骤相同,在此不再赘述。
在步骤S207中,将第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值。
其中,将第一声纹参数分别与目标用户的目标声纹参数进行相似度匹配,生成相应的匹配值。
在步骤S208中,当匹配值大于预设阈值时,将匹配值进行排序处理,获取大于预设阈值的匹配值中的最大匹配值,根据最大匹配值获取相匹配的目标声纹参数。
其中,当匹配值大于预设阈值如0.8时,即此时第一声纹参数与相应的目标声纹参数匹配成功,可认定为该匹配值对应的第一声纹参数对应的目标用户与目标声纹参数对应的目标用户大概率下为同一用户。若大于预设阈值的匹配值为多个时,将大于预设阈值的匹配值进行排序处理,获取其中的最大匹配值。此时则认定最大匹配值对应的第一声纹参数对应的目标用户与目标声纹参数对应的目标用户大概率下为同一人,获取最大匹配值相应的目标声纹参数。
另外,在一些实施方式中,当匹配值均小于预设阈值时,代表第一声纹参数与目标声纹参数均不匹配,即待识别语音对应的说话人与模型中的目标用户不匹配,此时声纹模型输入不匹配的匹配结果。
在步骤S209中,根据目标声纹参数获取相应的标识信息。
其中,由于目标声纹参数中包含目标用户的标识信息,此时便可根据匹配成功的目标声纹参数获取对应的标识信息。
在步骤S210中,将待识别语音信息输入语音识别模型,以生成相应的文本信息。
其中,在将待识别语音信息输入声纹模型获取标识信息时,同步将待识别语音信息输入至语音识别模型以获取文本信息。
在步骤S211中,将标识信息与文本信息相结合,以生成待识别语音信息相应的字幕信息。
其中,在获取标识信息与文本信息时,分别记录文本信息与标识信息所对应的时间信息,并根据时间信息将标识信息与文本信息相结合生成待识别语音信息的字幕信息。
在步骤S212中,将字幕信息标识至播放视频中。
其中,根据时间信息将字幕信息标识至播放视频中的预设区域,以保证字幕信息与播放视频中的语音信息相同步。
由上述可知,本实施例提供的一种语音信息的处理方法,通过采集目标用户的语音信息,提取出该语音信息的目标语音特征信息;
将目标语音特征信息输入预设模型,以得到目标声纹参数;获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数;并将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。以此可以实现将目标用户的标识信息例如身份信息标识至播放视频中,帮助用户在观看视频时能更好的理解视频的内容,以保证用户体验。另外利用语音识别与声纹识别技术自动的为视频添加字幕信息,能够很大程度上减少人工标注操作,节省人力成本。
为便于更好的实施本申请实施例提供的语音信息的处理方法,本申请实施例还提供一种基于上述语音信息的处理方法的装置。其中名词的含义与上述语音信息的处理方法中相同,具体实现细节可以参考方法实施例中的说明。
本发明实施例提供一种视频语音的处理装置,包括:
采集单元,用于采集目标用户的语音信息,提取出该语音信息的目标语音特征信息;
输入单元,用于将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取单元,用于获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数;
匹配单元,用于将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。
在一实施方式中,该装置还可以包括:训练单元,用于通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,该背景数据包括每一目标用户的语音信息。
在一实施方式中,输入单元,可以包括:输入子单元,用于将该目标语音特征信息输入预设模型,以得到与该共同语音特征信息相应的目标差异特征信息;确定子单元,用于根据该目标差异特征信息确定出第二声纹参数;处理子单元,用于对该第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
在一实施方式中,匹配单元,可以包括:匹配子单元,用于将该第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;获取子单元,用于当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
在一实施方式中,匹配单元,还可以包括:生成子单元,用于将该待识别语音信息输入语音识别模型,以生成相应的文本信息;结合子单元,用于将该标识信息与该文本信息相结合,以生成该待识别语音信息相应的字幕信息;标识子单元,用于将该字幕信息标识至该播放视频中。
请参阅图3,图3为本申请实施例提供的语音信息的处理装置的模块示意图。具体而言,该语音信息的处理装置300包括:采集单元31、输入单元32、获取单元33以及匹配单元34。
采集单元31,用于采集目标用户的语音信息,提取出该语音信息的目标语音特征信息。
其中,采集单元31采集的目标用户的语音信息是指经过标注后的语音信息,故目标用户的语音信息中包含目标用户的标识信息,进一步的,标识信息可以是指目标用户的身份信息,例如姓名、性别、年龄、称号等等个人信息。
另外,采集单元31提取的目标语音特征信息指目标语音声纹特征信息,可以理解的是,因为人在讲话时使用的发声器官如:舌、牙齿、喉头、肺、鼻腔在尺寸和形态方面每个人的差异很大,导致每个人的声纹均存在差异。故声纹特征信息是每个人特有的特征,如同每个人有自己独一无二的指纹一样。进一步的,声纹特征信息可以用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient,MFCC)来表示。
输入单元32,用于将目标语音特征信息输入预设模型,以得到目标声纹参数。
其中,输入单元32通过将目标用户的语音信息的目标语音特征信息输入预设模型,以得到该语音信息相应的调整后的语音特征信息,另外,由于预设模型中包含有每一目标用户相应的共同语音特征信息,故输入单元32可根据调整后的语音特征信息与共同语音特征信息确定出相应的目标声纹参数。
获取单元33,用于获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数。
其中,获取单元33中获取播放视频中的待识别语音信息的方式可以包括实时获取正在播放的视频或者直播视频的待识别语音信息、或者获取本地存储的视频的待识别语音信息。另外,获取单元33提取出该待识别语音信息的第一声纹参数的步骤与通过输入单元32获取目标声纹参数的步骤相同。
匹配单元34,用于将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。
其中,匹配单元34将第一声纹参数与目标声纹参数进行匹配,并得到相应的匹配结果,根据匹配结果可确定第一声纹参数相匹配的目标声纹参数,其中,由于每一目标声纹参数分别对应着每一目标用户的信息,即目标声纹参数中包含相应的目标用户的标识信息,故可根据目标声纹参数确认第一声纹参数对应的目标用户的信息。
可一并参考图4,图4为本申请实施例提供的语音信息的处理装置的另一模块示意图。该语音信息的处理装置300还可以包括:训练单元35,用于通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,该背景数据包括每一目标用户的语音信息。
其中,输入单元32,可以包括:输入子单元321,用于将该目标语音特征信息输入预设模型,以得到与该共同语音特征信息相应的目标差异特征信息;确定子单元322,用于根据该目标差异特征信息确定出第二声纹参数;处理子单元323,用于对该第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
其中,匹配单元34,可以包括:匹配子单元341,用于将该第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;获取子单元342,用于当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。生成子单元343,用于将该待识别语音信息输入语音识别模型,以生成相应的文本信息;结合子单元344,用于将该标识信息与该文本信息相结合,以生成该待识别语音信息相应的字幕信息;标识子单元345,用于将该字幕信息标识至该播放视频中。
本申请实施例还提供一种电子设备。请参阅图5,电子设备500包括处理器501以及存储器502。其中,处理器501与存储器502电性连接。
该处理器500是电子设备500的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器502内的计算机程序,以及调用存储在存储器502内的数据,执行电子设备500的各种功能并处理数据,从而对电子设备500进行整体监控。
该存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。
在本申请实施例中,电子设备500中的处理器501会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中,并由处理器501运行存储在存储器502中的计算机程序,从而实现各种功能,如下:
采集目标用户的语音信息,提取出语音信息的目标语音特征信息;
将目标语音特征信息输入预设模型,以得到目标声纹参数;
获取播放视频中的待识别语音信息,并提取出待识别语音信息的第一声纹参数;
将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将标识信息标识至该播放视频中。
在某些实施方式中,将目标语音特征信息输入预设模型之前,处理器501还可以具体执行以下步骤:
通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,该背景数据包括每一目标用户的语音信息。
在某些实施方式中,将目标语音特征信息输入预设模型,以得到目标声纹参数时,处理器501可以具体执行以下步骤:
将该目标语音特征信息输入预设模型,以得到与该共同语音特征信息相应的目标差异特征信息;
根据该目标差异特征信息确定出第二声纹参数;
利用线性鉴别分析的方法对第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
在某些实施方式中,将第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息时,处理器501可以具体执行以下步骤:
将该第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;
当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
其中,在某些实施方式中,获取相匹配的目标声纹参数的标识信息时,处理器501可以具体执行以下步骤:
将该匹配值进行排序处理,获取大于预设阈值的匹配值中的最大匹配值,根据该最大匹配值获取相匹配的目标声纹参数;
根据该目标声纹参数获取该相应的标识信息。
在某些实施方式中,将标识信息标识至该播放视频中时,处理器501可以具体执行以下步骤:
将待识别语音信息输入语音识别模型,以生成相应的文本信息;
将该标识信息与文本信息相结合,以生成待识别语音信息相应的字幕信息;
将字幕信息标识至该播放视频中。
请一并参阅图6,在某些实施方式中,电子设备500还可以包括:显示器503、射频电路504、音频电路505以及电源506。其中,其中,显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。
该显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。
该射频电路504可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。
该音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。
该电源506可以用于给电子设备500的各个部件供电。在一些实施例中,电源506可以通过电源管理系统与处理器501逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管图6中未示出,电子设备500还可以包括摄像头、蓝牙模块等,在此不再赘述。
本申请实施例还提供一种存储介质,该存储介质存储有计算机程序,当该计算机程序在计算机上运行时,使得该计算机执行上述任一实施例中的语音信 息的处理方法,比如:采集目标用户的语音信息,提取出该语音信息的目标语音特征信息;将目标语音特征信息输入预设模型,以得到目标声纹参数;获取播放视频中的待识别语音信息,并提取出该待识别语音信息的第一声纹参数;将该第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将该标识信息标识至该播放视频中。
在本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)、或者随机存取记忆体(Random Access Memory,RAM)等。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对本申请实施例的语音信息的处理方法而言,本领域普通测试人员可以理解实现本申请实施例的语音信息的处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如语音信息的处理方法的实施例的流程。其中,该的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。
对本申请实施例的语音信息的处理装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,该存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种语音信息的处理方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种语音信息的处理方法,其中,包括:
    采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
    将目标语音特征信息输入预设模型,以得到目标声纹参数;
    获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
    将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
  2. 根据权利要求1所述的方法,其中,所述将目标语音特征信息输入预设模型的步骤之前,还包括:
    通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,所述背景数据包括每一目标用户的语音信息。
  3. 根据权利要求2所述的方法,其中,所述将目标语音特征信息输入预设模型,以得到目标声纹参数的步骤,包括:
    将所述目标语音特征信息输入预设模型,以得到与所述共同语音特征信息相应的目标差异特征信息;
    根据所述目标差异特征信息确定出第二声纹参数;
    对所述第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
  4. 根据权利要求3所述的方法,其中,所述对所述第二声纹参数进行信道补偿的步骤,包括:
    利用线性鉴别分析的方法对所述第二声纹参数进行信道补偿。
  5. 根据权利要求1所述的方法,其中,所述将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息的步骤,包括:
    将所述第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;
    当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
  6. 根据权利要求5所述的方法,其中,所述获取相匹配的目标声纹参数的标识信息的步骤,包括:
    将所述匹配值进行排序处理,获取大于预设阈值的匹配值中的最大匹配值,根据所述最大匹配值获取相匹配的目标声纹参数;
    根据所述目标声纹参数获取所述相应的标识信息。
  7. 根据权利要求1所述的方法,其中,所述将所述标识信息标识至所述播放视频中的步骤,包括:
    将所述待识别语音信息输入语音识别模型,以生成相应的文本信息;
    将所述标识信息与所述文本信息相结合,以生成所述待识别语音信息相应的字幕信息;
    将所述字幕信息标识至所述播放视频中。
  8. 一种语音信息的处理装置,其中,包括:
    采集单元,用于采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
    输入单元,用于将目标语音特征信息输入预设模型,以得到目标声纹参数;
    获取单元,用于获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
    匹配单元,用于将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
  9. 根据权利要求8所述的装置,其中,所述装置还包括:
    训练单元,用于通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,所述背景数据包括每一目标用户的语音信息。
  10. 根据权利要求9所述的装置,其中,所述输入单元,包括:
    输入子单元,用于将所述目标语音特征信息输入预设模型,以得到与所述共同语音特征信息相应的目标差异特征信息;
    确定子单元,用于根据所述目标差异特征信息确定出第二声纹参数;
    处理子单元,用于对所述第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
  11. 根据权利要求8所述的装置,其中,所述匹配单元,包括:
    匹配子单元,用于将所述第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;
    获取子单元,用于当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
  12. 根据权利要求11所述的装置,其中,所述匹配单元,还包括:
    生成子单元,用于将所述待识别语音信息输入语音识别模型,以生成相应的文本信息;
    结合子单元,用于将所述标识信息与所述文本信息相结合,以生成所述待识别语音信息相应的字幕信息;
    标识子单元,用于将所述字幕信息标识至所述播放视频中。
  13. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1所述的语音信息的处理方法。
  14. 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    采集目标用户的语音信息,提取出所述语音信息的目标语音特征信息;
    将目标语音特征信息输入预设模型,以得到目标声纹参数;
    获取播放视频中的待识别语音信息,并提取出所述待识别语音信息的第一声纹参数;
    将所述第一声纹参数与目标声纹参数进行匹配,根据匹配结果获取相匹配的目标声纹参数的标识信息,并将所述标识信息标识至所述播放视频中。
  15. 根据权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,还用于执行步骤:
    通过预设算法对背景数据进行训练,以生成包含有每一目标用户相应的共同语音特征信息的预设模型,所述背景数据包括每一目标用户的语音信息。
  16. 根据权利要求15所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述目标语音特征信息输入预设模型,以得到与所述共同语音特征信息相应的目标差异特征信息;
    根据所述目标差异特征信息确定出第二声纹参数;
    对所述第二声纹参数进行信道补偿,以得到相应的目标声纹参数。
  17. 根据权利要求16所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    利用线性鉴别分析的方法对所述第二声纹参数进行信道补偿。
  18. 根据权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述第一声纹参数与目标声纹参数进行匹配,生成相应的匹配值;
    当匹配值大于预设阈值时,获取相匹配的目标声纹参数的标识信息。
  19. 根据权利要求18所述的电子设备,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述匹配值进行排序处理,获取大于预设阈值的匹配值中的最大匹配值,根据所述最大匹配值获取相匹配的目标声纹参数;
    根据所述目标声纹参数获取所述相应的标识信息。
  20. 根据权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述待识别语音信息输入语音识别模型,以生成相应的文本信息;
    将所述标识信息与所述文本信息相结合,以生成所述待识别语音信息相应的字幕信息;
    将所述字幕信息标识至所述播放视频中。
PCT/CN2019/073642 2019-01-29 2019-01-29 语音信息的处理方法、装置、存储介质及电子设备 WO2020154883A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980076330.XA CN113056784A (zh) 2019-01-29 2019-01-29 语音信息的处理方法、装置、存储介质及电子设备
PCT/CN2019/073642 WO2020154883A1 (zh) 2019-01-29 2019-01-29 语音信息的处理方法、装置、存储介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/073642 WO2020154883A1 (zh) 2019-01-29 2019-01-29 语音信息的处理方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2020154883A1 true WO2020154883A1 (zh) 2020-08-06

Family

ID=71841736

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073642 WO2020154883A1 (zh) 2019-01-29 2019-01-29 语音信息的处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN113056784A (zh)
WO (1) WO2020154883A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113596572A (zh) * 2021-07-28 2021-11-02 Oppo广东移动通信有限公司 一种语音识别方法、装置、存储介质及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011012A1 (en) * 2005-07-11 2007-01-11 Steve Yurick Method, system, and apparatus for facilitating captioning of multi-media content
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN105975569A (zh) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 一种语音处理的方法及终端

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395352B (zh) * 2016-05-16 2019-05-07 腾讯科技(深圳)有限公司 基于声纹的身份识别方法及装置
CN106057206B (zh) * 2016-06-01 2019-05-03 腾讯科技(深圳)有限公司 声纹模型训练方法、声纹识别方法及装置
CN112399133B (zh) * 2016-09-30 2023-04-18 阿里巴巴集团控股有限公司 一种会议分享方法及装置
CN106971713B (zh) * 2017-01-18 2020-01-07 北京华控智加科技有限公司 基于密度峰值聚类和变分贝叶斯的说话人标记方法与系统
CN107221331A (zh) * 2017-06-05 2017-09-29 深圳市讯联智付网络有限公司 一种基于声纹的身份识别方法和设备
CN107357875B (zh) * 2017-07-04 2021-09-10 北京奇艺世纪科技有限公司 一种语音搜索方法、装置及电子设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011012A1 (en) * 2005-07-11 2007-01-11 Steve Yurick Method, system, and apparatus for facilitating captioning of multi-media content
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN105975569A (zh) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 一种语音处理的方法及终端

Also Published As

Publication number Publication date
CN113056784A (zh) 2021-06-29

Similar Documents

Publication Publication Date Title
US11830241B2 (en) Auto-curation and personalization of sports highlights
US10726836B2 (en) Providing audio and video feedback with character based on voice command
CN107659847B (zh) 语音互动方法和装置
US8442389B2 (en) Electronic apparatus, reproduction control system, reproduction control method, and program therefor
US9524282B2 (en) Data augmentation with real-time annotations
US8847884B2 (en) Electronic device and method for offering services according to user facial expressions
WO2019000991A1 (zh) 一种声纹识别方法及装置
JP2019212308A (ja) 動画サービス提供方法およびこれを用いるサービスサーバ
US8521007B2 (en) Information processing method, information processing device, scene metadata extraction device, loss recovery information generation device, and programs
CN110602516A (zh) 基于视频直播的信息交互方法、装置及电子设备
CN105512348A (zh) 用于处理视频和相关音频的方法和装置及检索方法和装置
WO2017166651A1 (zh) 语音识别模型训练方法、说话人类型识别方法及装置
WO2023197979A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
Hoover et al. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers
WO2017181611A1 (zh) 在特定视频库中搜索视频的方法及其视频终端
CN112653902A (zh) 说话人识别方法、装置及电子设备
CN110990534B (zh) 一种数据处理方法、装置和用于数据处理的装置
US9525841B2 (en) Imaging device for associating image data with shooting condition information
CN113242361B (zh) 一种视频处理方法、装置以及计算机可读存储介质
WO2021114808A1 (zh) 音频处理方法、装置、电子设备和存储介质
US20180198990A1 (en) Suggestion of visual effects based on detected sound patterns
WO2019101099A1 (zh) 视频节目识别方法、设备、终端、系统和存储介质
US9542976B2 (en) Synchronizing videos with frame-based metadata using video content
WO2022193911A1 (zh) 指令信息获取方法及装置、可读存储介质、电子设备
JP2011164681A (ja) 文字入力装置、文字入力方法、文字入力プログラムおよびそれを記録したコンピュータ読み取り可能な記録媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19912650

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19912650

Country of ref document: EP

Kind code of ref document: A1