CN111986680A - Method and device for evaluating spoken language of object, storage medium and electronic device - Google Patents

Method and device for evaluating spoken language of object, storage medium and electronic device Download PDF

Info

Publication number
CN111986680A
CN111986680A CN202010871713.0A CN202010871713A CN111986680A CN 111986680 A CN111986680 A CN 111986680A CN 202010871713 A CN202010871713 A CN 202010871713A CN 111986680 A CN111986680 A CN 111986680A
Authority
CN
China
Prior art keywords
voice
target
voice data
voiceprint
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010871713.0A
Other languages
Chinese (zh)
Inventor
余浩
徐灿
鲁文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Hongen Perfect Future Education Technology Co ltd
Original Assignee
Tianjin Hongen Perfect Future Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Hongen Perfect Future Education Technology Co ltd filed Critical Tianjin Hongen Perfect Future Education Technology Co ltd
Priority to CN202010871713.0A priority Critical patent/CN111986680A/en
Publication of CN111986680A publication Critical patent/CN111986680A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Abstract

The application provides a method and a device for evaluating a spoken language of an object, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object; determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic; acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; and carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object. By the method and the device, the problems that the noise reduction mode in the voice evaluation method in the related art is high in implementation cost, complex in implementation process and incapable of effectively eliminating background voice are solved.

Description

Method and device for evaluating spoken language of object, storage medium and electronic device
Technical Field
The present application relates to the field of computers, and in particular, to a method and an apparatus for evaluating a spoken language of an object, a storage medium, and an electronic apparatus.
Background
Spoken language assessment models are often used to assess the spoken language proficiency of a speaker. Due to the influence of environmental factors and the like, the speech data used for spoken language evaluation may include environmental noise. In order to improve the accuracy of the spoken language evaluation, the speech data of the spoken language evaluation can be subjected to noise reduction treatment by adopting various noise reduction modes during the spoken language test. Common noise reduction approaches include: a noise reduction mode based on signal processing and a noise reduction mode based on an acoustic model.
The noise reduction mode based on signal processing is a relatively common noise reduction mode, and a signal processing noise reduction module is added at the front end of a speech recognition model to reduce noise of noisy data, as shown in fig. 1. Signal processing noise reduction modules typically require an array of microphones to achieve better performance.
The problem of background Noise can be solved to a certain extent by the NS (Noise Suppression) algorithm. However, to achieve better noise reduction and data enhancement, it is generally necessary to introduce more spatial information with a microphone array, which is relatively costly, to make the speech less distorted. Moreover, in many application scenarios, a single microphone is usually used in the user's device, and the hardware requirement for speech noise reduction cannot be met.
In addition, since signal processing is mainly to reduce environmental noise and enhance the speaker's voice, it is difficult for current signal processing to extract the voice of the target speaker from the voice data containing background speaker noise. More algorithm modules are usually introduced for the scene to achieve noise reduction, such as sound source localization (DOA), Echo Cancellation (AEC), and so on.
Therefore, in order to obtain better performance based on the signal processing noise reduction mode, spatial information provided by a microphone array needs to be utilized, the implementation cost is high, and the implementation module is complex.
The noise reduction method based on the acoustic model trains an acoustic model robust to noise, generally by manually collecting speech samples with corresponding scene noise or by manually making noisy samples, as shown in fig. 2.
Noise reduction based on acoustic models is usually based on deep learning to absorb/reduce noise at the model end. Before training the model, the noisy data samples need to be searched. If the real voice data with noise is possessed, the data needs to be labeled in advance, and then the manual labeling cost is relatively high. If the real voice data with noise does not exist, the noise data under the scene needs to be collected or recorded manually, then the obtained noise data is labeled, and the like, the same large labor cost needs to be consumed, and a certain difference exists between the acoustic model obtained by directly using the real data and the recognition effect. In addition, for a sample with voice containing background voice, the method cannot effectively eliminate the influence of the background voice.
Therefore, the noise reduction method in the speech evaluation method in the related art has the problems of high implementation cost, complex implementation process and incapability of effectively eliminating the background voice.
Disclosure of Invention
The application provides a method and a device for evaluating spoken language of an object, a storage medium and an electronic device, which are used for at least solving the problems that the noise reduction mode in a voice evaluation method in the related art is high in implementation cost, complex in implementation process and incapable of effectively eliminating background voice.
According to an aspect of an embodiment of the present application, a method for evaluating a spoken language of an object is provided, including: acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object; determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic; acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; and carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
Optionally, the obtaining of the target voiceprint feature of the target object and the first speech feature of the speech data to be evaluated includes: extracting the target voiceprint features of the target object from a voiceprint library, wherein the voiceprint library stores the voiceprint features of a plurality of objects, and the plurality of objects comprise the target object; and extracting the voice features of the voice data to be evaluated to obtain the first voice features.
Optionally, before the extracting the target voiceprint feature of the target object from the voiceprint library, the method further includes: displaying registration prompt information through a client of the target object, wherein the registration prompt information is used for prompting the target object to register voiceprint; receiving second voice data returned by the client, wherein the second voice data is the voice data input by the target object responding to the registration prompt message; extracting the voiceprint characteristics of the second voice data to obtain the target voiceprint characteristics; and saving the target voiceprint characteristics into the voiceprint library.
Optionally, the determining, according to the target voiceprint feature and the first speech feature, a target speech boundary of the target object in the speech data to be evaluated includes: inputting the target voiceprint feature and the first voice feature into a target voice activity detection model to obtain the target voice boundary output by the target voice activity detection model, wherein the target voice activity detection model is obtained by training an initial voice activity detection model by using a voiceprint feature of a training object and a voice feature of training voice data, and the training voice data is voice data marked with the voice boundary of the training object.
Optionally, the inputting the target voiceprint feature and the first voice feature into a target voice activity detection model, and obtaining the target voice boundary output by the target voice activity detection model includes: inputting the target voiceprint feature and the first voice feature into the target voice activity detection model to obtain the probability that each voice frame in the voice data to be evaluated belongs to the target object, wherein the probability is determined by the target voice activity detection model; determining a first voice frame belonging to the target object in the voice data to be evaluated, wherein the first voice frame is a voice frame of which the probability of belonging to the target object in the voice data to be evaluated is greater than or equal to a target probability threshold; and outputting the target voice boundary of the target object in the voice data to be evaluated according to the first voice frame.
Optionally, before the determining, according to the target voiceprint feature and the first speech feature, a target speech boundary of the target object in the speech data to be evaluated, the method further includes: acquiring voiceprint characteristics of the training object and voice characteristics of the training voice data; determining a second voice frame and a third voice frame according to the voice boundary of the training object in the training voice data, wherein the second voice frame is a voice frame belonging to the training object in the training voice data, and the third voice frame is a voice frame except the second voice frame in the training voice data; inputting the voiceprint feature of the training object and the voice feature of the training voice data into the initial voice activity detection model to obtain the probability that each voice frame in the training voice data output by the initial voice activity detection model belongs to the training object; and adjusting model parameters of the initial voice activity detection model to obtain the target voice activity detection model, wherein the probability that the second voice frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third voice frame belongs to the training object is smaller than the target probability threshold.
Optionally, the performing spoken language evaluation on the target object by using the first speech data to obtain a target evaluation result of the target object includes: acquiring a second voice characteristic of the first voice data; inputting the second voice characteristics into a voice recognition model to obtain a decoding result output by the voice recognition model, wherein the decoding result is used for indicating the probability that each pronunciation unit of the first voice data is a corresponding target pronunciation unit; and determining the target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit.
According to another aspect of the embodiments of the present application, there is also provided a device for evaluating a spoken language of an object, including: the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, and the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object; the first determining unit is used for determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint feature and the first voice feature; the second obtaining unit is used for obtaining first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; and the evaluation unit is used for carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
Optionally, the first obtaining unit includes: a first extraction module, configured to extract the target voiceprint features of the target object from a voiceprint library, where voiceprint features of a plurality of objects are stored in the voiceprint library, and the plurality of objects include the target object; and the second extraction module is used for extracting the voice features of the voice data to be evaluated to obtain the first voice features.
Optionally, the apparatus further comprises: a display unit, configured to display registration prompt information through a client of the target object before the target voiceprint feature of the target object is extracted from the voiceprint library, where the registration prompt information is used to prompt the target object to register a voiceprint; a receiving unit, configured to receive second voice data returned by the client, where the second voice data is voice data input by the target object in response to the registration prompt information; the extracting unit is used for extracting the voiceprint features of the second voice data to obtain the target voiceprint features; and the storage unit is used for storing the target voiceprint characteristics into the voiceprint library.
Optionally, the first determining unit includes: the first input module is configured to input the target voiceprint feature and the first voice feature to a target voice activity detection model, and obtain the target voice boundary output by the target voice activity detection model, where the target voice activity detection model is obtained by training an initial voice activity detection model using a voiceprint feature of a training object and a voice feature of training voice data, and the training voice data is voice data labeled with the voice boundary of the training object.
Optionally, the first input module comprises: the input submodule is used for inputting the target voiceprint characteristic and the first voice characteristic into the target voice activity detection model to obtain the probability that each voice frame in the voice data to be evaluated belongs to the target object, wherein the probability is determined by the target voice activity detection model; the determining submodule is used for determining a first voice frame belonging to the target object in the voice data to be evaluated, wherein the first voice frame is a voice frame of which the probability of belonging to the target object in the voice data to be evaluated is greater than or equal to a target probability threshold; and the output sub-module is used for outputting the target voice boundary of the target object in the voice data to be evaluated according to the first voice frame.
Optionally, the apparatus further comprises: a third obtaining unit, configured to obtain a voiceprint feature of the training object and a voice feature of the training voice data before determining a target voice boundary of the target object in the to-be-evaluated voice data according to the target voiceprint feature and the first voice feature; a second determining unit, configured to determine a second speech frame and a third speech frame according to a speech boundary of the training object in the training speech data, where the second speech frame is a speech frame belonging to the training object in the training speech data, and the third speech frame is a speech frame other than the second speech frame in the training speech data; an input unit, configured to input a voiceprint feature of the training object and a speech feature of the training speech data into the initial voice activity detection model, so as to obtain a probability that each speech frame in the training speech data, output by the initial voice activity detection model, belongs to the training object; and the adjusting unit is used for adjusting the model parameters of the initial voice activity detection model to obtain the target voice activity detection model, wherein the probability that the second voice frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third voice frame belongs to the training object is smaller than the target probability threshold.
Optionally, the evaluation unit includes: the acquisition module is used for acquiring a second voice characteristic of the first voice data; the second input module is used for inputting the second voice characteristics into a voice recognition model to obtain a decoding result output by the voice recognition model, wherein the decoding result is used for indicating the probability that each pronunciation unit of the first voice data is a corresponding target pronunciation unit; and the determining module is used for determining the target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit.
According to a further aspect of an embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.
According to a further aspect of an embodiment of the present application, there is also provided an electronic apparatus, including a memory and a processor, the memory storing a computer program therein, the processor being configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the application, a method of denoising by using a voiceprint feature and a voice feature of a specific speaker is adopted, and a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated are obtained, wherein the voice data to be evaluated is the voice data used for carrying out oral evaluation on the target object; determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic; acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; the target evaluation result of the target object is obtained by using the first voice data to perform spoken evaluation on the target object, and because the voiceprint characteristic and the voice characteristic of the target speaker (target object) are used to determine the voice boundary of the target speaker, a microphone array is not required to provide control information, and the implementation process is simple, the purpose of effective voice noise reduction can be achieved, the technical effects of reducing the spoken evaluation cost, simplifying the implementation process and improving the effectiveness of background voice removal are achieved, and the problems that the implementation cost is high, the implementation process is complex and the background voice cannot be effectively eliminated in the noise reduction mode in the voice evaluation method in the related art are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of an alternative method for spoken language evaluation of an object;
FIG. 2 is a schematic diagram of another alternative method for spoken language evaluation of an object;
FIG. 3 is a schematic diagram of a hardware environment of an alternative method for spoken language evaluation of an object, according to an embodiment of the invention;
FIG. 4 is a flow chart of an alternative method for spoken language evaluation of an object according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an alternative method for spoken language evaluation of an object according to an embodiment of the application;
FIG. 6 is a flow diagram of an alternative method for spoken language assessment of an object according to an embodiment of the present application;
FIG. 7 is a block diagram of an alternative apparatus for spoken language evaluation of an object according to an embodiment of the present application;
fig. 8 is a block diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial nouns or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:
ASR: automatic Speech Recognition;
VAD: voice Activity Detection for detecting voiced and unvoiced parts of a given speech;
NN: neural Network, Neural Network;
DNN: deep Neural Network, Deep Neural Network;
vector: vector quantity;
BSS: blind Source Separation, Blind Source Separation;
fbank: filter bank, common features of speech;
d-vector/x-vector: corresponding to the type of voiceprint feature.
According to one aspect of the embodiment of the application, a method for evaluating the spoken language of an object is provided. Alternatively, in this embodiment, the above-mentioned method for evaluating the spoken language of the object may be applied to a hardware environment formed by the terminal 302 and the server 304 as shown in fig. 3. As shown in fig. 3, the server 304 is connected to the terminal 302 via a network, which may be used to provide services (such as game services, application services, etc.) for the terminal or a client installed on the terminal, and may be provided with a database on the server or independent of the server for providing data storage services for the server 304, and the network includes but is not limited to: the terminal 302 is not limited to a PC, a mobile phone, a tablet computer, etc. for a wide area network, a metropolitan area network, or a local area network. The spoken language evaluation method of the object in the embodiment of the present application may be executed by the server 304, may also be executed by the terminal 302, and may also be executed by both the server 304 and the terminal 302. The terminal 302 may also execute the method for evaluating the spoken language of the object according to the embodiment of the present application by a client installed thereon.
Taking the operation at the terminal side as an example, fig. 4 is a flowchart of a method for spoken language evaluation of an optional object according to an embodiment of the present application, and as shown in fig. 4, the flowchart of the method may include the following steps:
step S402, obtaining a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object;
step S404, determining a target voice boundary of a target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic;
step S406, acquiring first voice data belonging to a target object from the voice data to be evaluated according to the target voice boundary;
step S408, carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
Through the steps from S402 to S408, a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated are obtained, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object; determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic; acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; the first voice data is used for carrying out spoken language evaluation on the target object to obtain a target evaluation result of the target object, so that the problems that the noise reduction mode in the voice evaluation method in the related technology is high in implementation cost, complex in implementation process and incapable of effectively eliminating background voice are solved, the spoken language evaluation cost is reduced, the implementation process is simplified, and the effectiveness of removing the background voice is improved.
In the technical solution provided in step S402, a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated are obtained, where the voice data to be evaluated is voice data used for performing spoken language evaluation on the target object.
The method for evaluating the spoken language of the object in the embodiment can be applied to a scene of carrying out spoken language evaluation on a certain language. In this scenario, the terminal device of the user may be in communication connection with a server, where the server is a server for performing spoken language evaluation. The terminal device may run a client with a target application, and the target application may be an application for spoken language evaluation. The client and the server can both belong to a spoken language evaluation system or a spoken language system with a spoken language evaluation function, and the spoken language system can be used for spoken language learning, learning communication, spoken language training, spoken language evaluation and the like.
The target object (corresponding to a certain user and a target speaker) can log in a client of a target application running on a terminal device of the target object by using an account number, a password, a dynamic password, a related application login and the like, and the client is triggered to enter a spoken language evaluation interface by executing a triggering operation. The trigger operation may be a click operation, a slide operation, or a combination thereof, which is not specifically limited in this embodiment.
The spoken language assessment may comprise a plurality of assessment resources, e.g., a plurality of topics, each assessment resource may comprise, but is not limited to, at least one of: the method comprises the steps of evaluating text prompt information of content, evaluating voice prompt information of content, evaluating text description information of content and evaluating reference voice data (namely, standard answers), wherein the text prompt information of the content and the text description information of the content can be displayed through a spoken language evaluating interface of a client, and the voice prompt information of the content and the reference voice data of the content can be played through a loudspeaker of a terminal device.
For example, in oral evaluation, the evaluation content is "XXXX" (one sentence), and a text prompt message may be displayed in the oral evaluation interface, where the text prompt message may prompt: and when the user inputs the voice, the current question is the second question, an interface for entering the previous question or the next question is formed, and the like, the text information of the evaluation content can be displayed, so that the user can conveniently know the content to be input. In addition, also can play voice prompt through the speaker, voice prompt can indicate: when the speech input was made, the current topic, etc. The standard answer may also be played through the speaker, and the number of plays may be one or more.
A button for starting voice input, a button for canceling voice input, a button for pausing voice input, and the like may be displayed on the spoken language evaluation interface of the client, and in addition, other buttons for controlling the progress of spoken language evaluation may also be displayed, which is not specifically limited in this embodiment.
For the target evaluation resource, the target object can perform voice input according to the prompt of the client, and input the voice data to be evaluated corresponding to the target evaluation resource, wherein the voice data to be evaluated can be the voice data used for performing spoken language evaluation on the target object. After the client acquires the voice data to be evaluated input by the user, the voice data to be evaluated can be sent to the server through the communication connection between the client and the server, so that the server can evaluate the spoken language conveniently.
The server can receive the voice data to be evaluated sent by the client, or acquire the voice data to be evaluated from the database. The spoken language evaluation voice data of different objects can be firstly stored in the database, and the server can acquire the spoken language evaluation voice data from the database for spoken language evaluation according to the time sequence or other sequences (such as priority levels) of the spoken language evaluation voice data.
In addition to the voice data to be evaluated, the server may further obtain a target voiceprint feature (feature data) of the target object, where the voiceprint feature may be pre-stored, may also be recorded in the field, and may also be obtained by other manners, which is not specifically limited in this embodiment.
Optionally, in this embodiment, embedding at a certain layer in the middle of the d-vector or x-vector model neural network may be used as the voiceprint feature, and the extracted speech feature may be Filter Bank (Fbank feature).
In the technical solution provided in step S404, a target speech boundary of a target object in the speech data to be evaluated is determined according to the target voiceprint feature and the first speech feature.
The periphery of the target speaker may contain a lot of noises, in order to improve the performance of spoken language evaluation and further improve the evaluation algorithm effect, the noise reduction processing can be performed on the speech data to be evaluated, and the noise reduction processing mode can be as follows: and removing the voice sections of the non-target speaker (such as the target object) at two ends aiming at the given voice section (such as the voice data to be evaluated), thereby extracting the target speaker or the voice of the speaker of interest for oral speech evaluation. Irrelevant voice is removed, evaluation scoring is not needed, and therefore the processing speed of spoken language evaluation can be improved. The boundary of the target speaker is marked, and the voice segment of the target speaker is obtained actually.
In order to remove the voice section of the non-target object, the server may determine a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint feature and the first voice feature. The combination of the target voiceprint feature and the first speech feature can be used to identify which part of the speech data to be evaluated is the speech input of the target object. The first speech feature may be: the Fbank feature of the speech data to be evaluated may also be a speech feature of other pronunciation units capable of recognizing the speech data, which is not specifically limited in this embodiment.
For example, the first speech feature may be used to identify an unmuted portion of the speech data to be evaluated, and the target voiceprint feature may be used to determine a target portion of the unmuted portion that belongs to the target object. For another example, the target voiceprint feature and the first speech feature may be combined as an integral feature to be used for identifying speech data belonging to the target object in the speech data to be evaluated.
The number of the target voice boundaries, that is, the boundaries between the voice data of the target object and the voice data of the non-target object, may be one or more, which is not specifically limited in this embodiment.
For example, the duration of the speech data to be evaluated of the target speaker is 1 minute, wherein the first 10s are silence and the speech input of the non-target speaker, and the speech boundary of the target speaker is: 10 th and 60 th.
For another example, the duration of the speech data to be evaluated of the target speaker is 1 minute, wherein the first 10s is silence and the speech input of the non-target speaker, and the last 5s is silence, then the speech boundary of the target speaker is: 10 th and 55 th.
For another example, the duration of the speech data to be evaluated of the target speaker is 1 minute, wherein the first 10s are silence and the speech input of the non-target speaker, the 25 th to 30 th are the speech input of the non-target speaker, and the last 5s are silence, then the speech boundary of the target speaker is: 10 th, 25 th, 30 th and 55 th.
In the technical solution provided in step S406, first speech data belonging to a target object is acquired from the speech data to be evaluated according to the target speech boundary.
According to the target voice boundary of the target object in the voice data to be evaluated, the first voice data belonging to the target object can be obtained from the voice data to be evaluated. The manner of acquiring the first voice data may be: and intercepting the voice section of the target object according to the target voice boundary so as to obtain first voice data belonging to the target object. The manner of acquiring the first voice data may also be: in the process of determining the boundary of the target voice, voice frames belonging to the target object are extracted in sequence to obtain first voice data belonging to the target object. This is not particularly limited in this embodiment.
When the speech segment of the target object is obtained, the speech segment between the (2n-1) th speech boundary and the 2 n-th speech boundary may be determined as the speech segment of the target object according to the number and the sequence of the target speech boundaries, where n is greater than or equal to a positive integer of 1. For example, a first speech boundary may be first searched, and a speech segment between the first speech boundary and a second speech boundary is determined to belong to a target object; and then searching whether a third voice section exists, if so, determining a voice section belonging to the target object by using the voice section between the third voice boundary and the fourth voice boundary, and repeating the steps until all the voice boundaries are searched.
For example, as previously described, if the speech boundary of the targeted speaker is: 10 th s and 60 th s, the speech segments of the target speaker are: 10-60 s; if the speech boundary of the targeted speaker is: 10 th s and 55 th s, the speech segments of the target speaker are: 10-55 s; if the speech boundary of the targeted speaker is: the 10 th, 25 th, 30 th and 55 th speech segments of the target speaker are: 10 to 25s and 30 to 55 s.
In the technical solution provided in step S408, the first speech data is used to perform spoken language evaluation on the target object, so as to obtain a target evaluation result of the target object.
After the first voice data of the target object is obtained, the first voice data can be used as data for carrying out spoken language evaluation on the target object. For example, a spoken language evaluation mode in the related art is adopted to perform spoken language evaluation on the first speech data to obtain a target evaluation result of the target object.
As an optional embodiment, the obtaining a target voiceprint feature of a target object and a first speech feature of speech data to be evaluated includes:
s11, extracting target voiceprint characteristics of the target object from the voiceprint library, wherein the voiceprint library stores the voiceprint characteristics of a plurality of objects, and the plurality of objects comprise the target object;
and S12, extracting the voice features of the voice data to be evaluated to obtain first voice features.
Voiceprint characteristics of a plurality of objects may be stored in a voiceprint library, and the plurality of objects may include the target object. When the target voiceprint characteristics of the target object are obtained, the server can extract the target voiceprint characteristics of the target object from the voiceprint library by using the object identification of the target object through communication connection with the voiceprint library.
Optionally, the voiceprint library may also be stored locally in the server, and the server may also directly match the target voiceprint features from the voiceprint library by using the object identifier of the target object.
The server may extract a first speech feature from the speech data to be evaluated using a speech feature extraction algorithm, where the first speech feature may be an FBank feature.
Through this embodiment, save the voiceprint characteristic of different objects through the voiceprint storehouse, can improve the acquisition efficiency of voiceprint characteristic, and then promote the speed of spoken language evaluation.
As an optional embodiment, before extracting the target voiceprint feature of the target object from the voiceprint library, the method further includes:
s21, displaying registration prompt information through the client of the target object, wherein the registration prompt information is used for prompting the target object to register the voiceprint;
s22, receiving second voice data returned by the client, wherein the second voice data is the voice data input by the target object responding to the registration prompt message;
s23, extracting the voiceprint features of the second voice data to obtain target voiceprint features;
and S24, storing the target voiceprint characteristics into a voiceprint library.
The voiceprint features in the voiceprint library can be pre-entered, and when the user uses the product initially, a speaker system (which can be the same as the spoken language evaluation system or can be a different system) can be used for reminding the target user of entering the voiceprint and registering the voiceprint in the library.
The server may send a voiceprint registration instruction to the terminal device to prompt the client of the target object to interact with the target object, and perform voiceprint registration of the target object. The client of the target object can display registration prompt information on a display interface of the client according to the voiceprint registration instruction of the server or preset configuration information so as to prompt the target object to register the voiceprint.
The target object can perform voice input according to the registration prompt message, and the client can acquire second voice data input by the target object and send the second voice data to the server. The voice input may be voice data of a specific content or voice data of an arbitrary content, for example, letting the user speak a few words in advance.
And the server receives second voice data sent by the client, extracts the target voiceprint characteristics from the second voice data, and stores the target voiceprint characteristics into a voiceprint library. For example, the server may combine the voiceprint characteristics of the above utterances spoken by the user as the voiceprint characteristics of the speaker registration. The user only needs to register once, and then the voiceprint characteristics can be automatically extracted from the voiceprint library to be directly used when the user carries out spoken language evaluation.
Through this embodiment, register the suggestion user through the voiceprint and carry out the voiceprint characteristic and type, can improve the acquisition efficiency of voiceprint characteristic, and then promote the speed of spoken language evaluation.
As an alternative embodiment, determining a target speech boundary of a target object in speech data to be evaluated according to a target voiceprint feature and a first speech feature includes:
and S31, inputting the target voiceprint feature and the first voice feature into a target voice activity detection model to obtain a target voice boundary output by the target voice activity detection model, wherein the target voice activity detection model is obtained by training an initial voice activity detection model by using the voiceprint feature of a training object and the voice feature of training voice data, and the training voice data is the voice data marked with the voice boundary of the training object.
The target object speech boundary may be determined by a personalized voice activity detection model (i.e., a target voice activity detection model). The target voice activity detection model (e.g., VAD model) is trained using voiceprint features of the training object and speech features of training speech data that mark speech boundaries of the training object. The traditional voice activity detection system is used for judging a mute speech segment and a non-mute speech segment, and the personalized voice activity detection model in the embodiment combines voiceprint characteristics and can be used for judging a target speaker and a non-target speaker to obtain two types of results.
For the speech data to be evaluated, the server may input the voiceprint feature of the target object (the target voiceprint feature) and the speech feature of the speech data to be evaluated (the first speech feature) to the target voice activity detection model, and the target voice boundary of the target object in the speech data to be evaluated is output due to the target voice activity detection model.
By the embodiment, the voice edge of the target object is determined through the personalized voice activity detection model, the compatibility of a voice boundary determination mode can be improved, and the model research and development cost is reduced.
As an alternative embodiment, inputting the target voiceprint feature and the first speech feature into the target voice activity detection model, and obtaining the target speech boundary output by the target voice activity detection model includes:
s41, inputting the target voiceprint characteristic and the first voice characteristic into the target voice activity detection model to obtain the probability that each voice frame in the voice data to be evaluated belongs to the target object, wherein the probability is determined by the target voice activity detection model;
s42, determining a first speech frame belonging to a target object in the speech data to be evaluated, wherein the first speech frame is a speech frame of which the probability of belonging to the target object in the speech data to be evaluated is greater than or equal to a target probability threshold;
and S43, outputting the target voice boundary of the target object in the voice data to be evaluated according to the first voice frame.
The server may first merge the target voiceprint feature and the first speech feature to obtain a target speech feature, and then input the target speech feature to the voice activity detection model. For example, the first speech feature is the Fbank feature, and at time t, it can be expressed as xt∈RDThe target speaker voice print feature extracted by the speaker system is vspkIt is possible to combine the speech features x directlytAnd vspkGenerating new features
Figure BDA0002651335900000151
Input features for the personalized voice activity detection model are combined to obtain
Figure BDA0002651335900000153
The method can be as follows:
Figure BDA0002651335900000152
the intermediate layer output of the personalized voice activity detection model may correspond to two classification tasks, i.e., giving the probability of whether each frame is a target speaker or a non-target speaker, and obtaining the speech boundary of the target speaker. After the target voice characteristics are input into the target voice activity detection model, the target voice activity detection model can determine the probability that each voice frame in the voice data to be evaluated belongs to the target object.
The probability that different speech frames determined by the target voice activity detection model belong to the target object is different, and if the probability that one speech frame belongs to the target object is greater than or equal to a target probability threshold, the speech frame can be determined as the speech frame belonging to the target object, namely, a first speech frame; if the probability that a speech frame belongs to the target object is less than the target probability threshold, it can be determined as a speech frame not belonging to the target object.
According to the first voice frame, the target voice activity detection model can output a target voice boundary of a target object in the voice data to be evaluated. The target speech boundary may be obtained by traversing all of the first speech frames in sequence.
Optionally, a first speech frame may be determined as a first class speech boundary of the target object; then, sequentially searching each first voice frame, and determining the first voice frame as a second voice boundary under the condition that the first voice frame is adjacent to the first voice boundary as to the searched first voice frame; under the condition that the first voice frame is adjacent to the second voice boundary, updating the second voice boundary to the current first voice frame; in case the first speech frame is not adjacent to any speech boundary, it may be determined as a first type of speech boundary. The output target speech boundary includes: a first type of speech boundary and a second type of speech boundary.
After the first-class speech boundary and the second-class speech boundary are obtained, a time period between the first-class speech boundary and an adjacent second-class speech boundary may be determined as a time period belonging to the target object, where the adjacent speech boundaries may be: adjacent backwards in the time axis.
According to the embodiment, the voice frames belonging to the target object are determined according to the probability that each voice frame belongs to the target object, and then the voice boundary of the target object is determined, so that the compatibility of a voice boundary determination model (compatible with a voice activity detection model) can be improved, and the cost of oral language evaluation is reduced.
As an optional embodiment, before determining a target speech boundary of a target object in speech data to be evaluated according to the target voiceprint feature and the first speech feature, the method further includes:
s51, acquiring the voiceprint characteristics of the training object and the voice characteristics of the training voice data;
s52, determining a second voice frame and a third voice frame according to the voice boundary of a training object in the training voice data, wherein the second voice frame is a voice frame belonging to the training object in the training voice data, and the third voice frame is a voice frame except the second voice frame in the training voice data;
s53, inputting the voiceprint characteristic of the training object and the voice characteristic of the training voice data into the initial voice activity detection model to obtain the probability that each voice frame in the training voice data output by the initial voice activity detection model belongs to the training object;
s54, adjusting the model parameters of the initial voice activity detection model to obtain a target voice activity detection model, wherein the probability that the second voice frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third voice frame belongs to the training object is less than the target probability threshold.
Before using the target voice activity detection model, a voiceprint feature of a training object and a voice feature of training voice data of the training object can be obtained, wherein the training voice data is voice data marked with a voice boundary of the training object; and then, training the initial voice activity detection model by using the voiceprint characteristics of the training object and the voice characteristics of the training voice data to obtain a target voice activity detection model. The server or device that trains the initial model may or may not be the same as the server or device that uses the target model for speech boundary determination.
Before model training, the voiceprint features of a training object and the voice features of training voice data can be combined to obtain model training features; in performing model training, the initial voice activity detection model may be trained using model training features. The personalized voice activity detection training target is two classification tasks, namely, the probability that each frame is a specific speaker (training object) or an unspecific speaker is given, and the voice boundary of the specific speaker is obtained.
Alternatively, the speech frames (i.e., the second speech frames) belonging to the training object in the training speech data and the speech frames (i.e., the third speech frames) other than the second speech frames in the training speech data may be determined according to the speech boundary of the training object in the training speech data.
The model training features are input to the initial voice activity detection model, and a first output result of the initial voice activity detection model is obtained, wherein the first output result can indicate the probability that each voice frame in the training voice data belongs to the training object. According to the first output result, the marked second voice frame and the marked third voice frame, the model parameters of the initial voice activity detection model can be adjusted, so that the probability that the second voice frame indicated by the second output result output by the adjusted voice activity detection model belongs to the training object is larger than the probability that the second voice frame indicated by the first output result belongs to the training object, and the probability that the third voice frame indicated by the second output result belongs to the training object is smaller than the probability that the third voice frame indicated by the first output result belongs to the training object.
The number of training speech data may be plural, and correspondingly, the number of model training features may also be plural, and the plural model training features may be sequentially input to the voice activity detection model and the model parameters of the voice activity detection model may be adjusted. And finishing training under the condition of meeting the target function through multiple rounds of iteration, thereby obtaining the target sound activity detection model. The probability that the second speech frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third speech frame output by the target voice activity detection model belongs to the training object is less than the target probability threshold.
It should be noted that, training the voice activity detection model is a two-class classification task, one is a specific speaker, and the other is an unspecific speaker. In model prediction, the probability of each frame belonging to a specific speaker or a non-specific speaker is given.
By the embodiment, the initial voice activity detection model is trained by using the training voice data marked with the voice boundary of the training object to obtain the target voice activity detection model, so that the capability of the target voice activity detection model for recognizing the voice boundary of the specific speaker can be improved.
As an alternative embodiment, the performing spoken language evaluation on the target object by using the first speech data to obtain a target evaluation result of the target object includes:
s61, acquiring a second voice characteristic of the first voice data;
s62, inputting the second voice characteristics into the voice recognition model to obtain a decoding result output by the voice recognition model, wherein the decoding result is used for indicating the probability that each pronunciation unit of the first voice data is a corresponding target pronunciation unit;
and S63, determining a target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit.
In the case of spoken language assessment of the target object using the first speech data, the second speech feature of the first speech data may be acquired first. Because the first voice data is part of the voice data to be evaluated, the second voice feature of the first voice data can be extracted from the first voice feature of the voice data to be evaluated.
For example, the speech feature is an Fbank feature, and an Fbank feature interval that can be reused can be calculated according to the time boundary of the output of the target voice activity detection model.
After obtaining the second speech feature, the second speech feature may be fed into a speech recognition model (e.g., ASR model) for decoding, and probability information of the designated pronunciation unit, that is, the probability that each pronunciation unit of the first speech data is the corresponding target pronunciation unit, may be obtained. For example, the speech recognition model can directly perform limited decoding (forced alignment) and normal decoding on the target speaker speech segment, and these decoding results are used as the basis for scoring.
Each pronunciation unit may include one or more speech frames corresponding to a recognized phoneme, and the target pronunciation unit corresponds to a target phoneme in the target evaluation resource, the target pronunciation unit being included in the target phoneme corresponding to one or more standard speech frames.
And determining a target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit. For example, the spoken language evaluation model may determine a target evaluation result of the target object using the decoding result. The spoken language evaluation model can use the decoded result of the speech recognition model to judge, for example, the result of forced alignment is used as a standard answer, the difference between the normal decoding result and the labeled answer is compared, and each pronunciation unit is scored in sequence, so that the target evaluation result of the target object is obtained.
According to the embodiment, the pronunciation unit is identified by using the voice recognition model, and the spoken language evaluation is performed according to the decoding result of the voice recognition model, so that the rationality of the spoken language evaluation can be improved, and the compatibility of the spoken language evaluation is improved (the method is compatible with the existing spoken language evaluation method).
The following explains a spoken language evaluation method of an object in the embodiment of the present application with reference to an optional example. In this example, the speech features are Fbank features of the speech data, the speech recognition model is an ASR model, the target voice activity detection model is an individualized VAD model, and the voiceprint features are extracted by the speaker recognition system.
In the spoken language evaluation, in order to enable the target to obtain better evaluation experience and more accurate evaluation result, the noise of the speech can be reduced in advance. Due to the complexity of the scene, the noise types are also quite complex, even in the case of background human voice.
In this example, the voice section of the non-target speaker can be effectively removed by combining the VAD decision target speaker mode of the voiceprint feature, training the speaker recognition system in advance, obtaining the speaker feature (voiceprint feature) of the current user by using the system, and outputting two types of probabilities of the target speaker and the non-target speaker by combining the target speaker feature when training the VAD system, wherein the non-target speaker also has the possibility of pure silence.
Prior to spoken language evaluation, a speaker recognition system may be trained, for example, using NN-based speaker models, such as d-vector or x-vector models. The speaker recognition model can adopt a conventional d-vector model and an x-vector model, the training data of the speaker recognition model can be voice data containing a speaking label, and the training data can be real user data in a specific scene. When a user initially uses a product (client), the voiceprint of the target user can be extracted and registered in the library using the speaker system.
The personalized VAD model can be trained by combining with the voiceprint characteristics, and when the VAD model is trained, the initial VAD model can be trained by using the voiceprint characteristics of a reference speaker (training object) and the Fbank characteristics of reference voice data (training voice data). When training the later VAD model, the model is characterized by two inputs, namely a voiceprint feature and a Fbank feature, for example, the Fbank is 40-dimensional, the voiceprint feature is 100-dimensional, and after combination, the voiceprint feature is 140-dimensional, and the 140-dimensional feature is used as the input of the VAD model. The training data may be real labeled data in a specific scene, i.e., the targeted speaker and the non-targeted speaker regions have been labeled in the data. Training the model may employ an NN VAD training method.
With reference to fig. 5 and fig. 6, the flow of the method for evaluating the spoken language of the object in this alternative example may include the following steps:
step S602, extracting voiceprint features and voice features of the target speaker.
For the target speaker, when the product is used later to receive the user voice, the voice feature of the voice data to be evaluated of the target speaker and the voiceprint feature of the target speaker can be extracted.
And step S604, combining the voice characteristics and the voiceprint characteristics, and inputting the voice characteristics and the voiceprint characteristics into the personalized VAD model to obtain the voice boundary of the target speaker output by the VAD model.
The voice boundary of the target speaker can be judged by combining the extracted voice characteristic and the sound texture characteristic and sending the voice characteristic to the personalized VAD, so that the voice sections of the non-target speakers at the two ends are removed, and the voice section of the target speaker is obtained.
And step S606, the speech segment of the target speaker is sent to the ASR model and the spoken language evaluation model for final output.
And for the voice section of the target speaker, corresponding Fbank characteristics can be extracted again and sent into an ASR model for decoding, probability information of a specified pronunciation unit is obtained, and the spoken language assessment is scored according to the probability information.
According to the method, noise reduction processing is carried out at the front end of spoken language evaluation, evaluation scoring is carried out after a final target speaker voice section is obtained through a personalized VAD system combined with the voiceprint of the target speaker, and since the voice sections of non-target speakers at two ends are removed according to the given voice section, the evaluation voice data can be effectively subjected to noise reduction, the accuracy of spoken language evaluation is improved, the efficiency of spoken language evaluation is improved, and meanwhile the cost of spoken language evaluation can be reduced.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
According to another aspect of the embodiment of the application, a spoken language evaluation device of the object is further provided, wherein the spoken language evaluation device is used for implementing the spoken language evaluation method of the object. Fig. 7 is a block diagram of a structure of an optional apparatus for evaluating spoken language of an object according to an embodiment of the present application, and as shown in fig. 7, the apparatus may include:
(1) a first obtaining unit 702, configured to obtain a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, where the voice data to be evaluated is voice data used for performing spoken language evaluation on the target object;
(2) a first determining unit 704, connected to the first obtaining unit 702, configured to determine a target speech boundary of a target object in the speech data to be evaluated according to the target voiceprint feature and the first speech feature;
(3) the second obtaining unit 706 is connected to the first determining unit 704, and configured to obtain, according to the target speech boundary, first speech data belonging to the target object from the speech data to be evaluated;
(4) the evaluating unit 708 is connected to the second obtaining unit 706, and configured to perform spoken language evaluation on the target object by using the first speech data to obtain a target evaluation result of the target object.
It should be noted that the first obtaining unit 702 in this embodiment may be configured to execute the step S402, the first determining unit 704 in this embodiment may be configured to execute the step S404, the second obtaining unit 706 in this embodiment may be configured to execute the step S406, and the evaluating unit 708 in this embodiment may be configured to execute the step S408.
Acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated through the module, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object; determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic; acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary; the first voice data is used for carrying out spoken language evaluation on the target object to obtain a target evaluation result of the target object, so that the problems that the noise reduction mode in the voice evaluation method in the related technology is high in implementation cost, complex in implementation process and incapable of effectively eliminating background voice are solved, the spoken language evaluation cost is reduced, the implementation process is simplified, and the effectiveness of removing the background voice is improved.
As an alternative embodiment, the first obtaining unit 702 includes:
the first extraction module is used for extracting target voiceprint characteristics of a target object from a voiceprint library, wherein the voiceprint library stores the voiceprint characteristics of a plurality of objects, and the plurality of objects comprise the target object;
and the second extraction module is used for extracting the voice features of the voice data to be evaluated to obtain the first voice features.
As an alternative embodiment, the apparatus further comprises:
the display unit is used for displaying registration prompt information through a client of the target object before the target voiceprint characteristics of the target object are extracted from the voiceprint library, wherein the registration prompt information is used for prompting the target object to register the voiceprint;
the receiving unit is used for receiving second voice data returned by the client, wherein the second voice data is the voice data input by the target object responding to the registration prompt information;
the extracting unit is used for extracting the voiceprint characteristics of the second voice data to obtain target voiceprint characteristics;
and the storage unit is used for storing the target voiceprint characteristics into the voiceprint library.
As an alternative embodiment, the first determining unit 704 includes:
and the first input module is used for inputting the target voiceprint characteristic and the first voice characteristic into the target voice activity detection model to obtain a target voice boundary output by the target voice activity detection model, wherein the target voice activity detection model is obtained by training the initial voice activity detection model by using the voiceprint characteristic of the training object and the voice characteristic of the training voice data, and the training voice data is the voice data marked with the voice boundary of the training object.
As an alternative embodiment, the first input module comprises:
the input submodule is used for inputting the target voiceprint characteristic and the first voice characteristic into the target voice activity detection model to obtain the probability that each voice frame in the voice data to be evaluated belongs to the target object, wherein the probability is determined by the target voice activity detection model;
the determining submodule is used for determining a first voice frame belonging to a target object in the voice data to be evaluated, wherein the first voice frame is a voice frame of which the probability of belonging to the target object in the voice data to be evaluated is greater than or equal to a target probability threshold;
and the output submodule is used for outputting the target voice boundary of the target object in the voice data to be evaluated according to the first voice frame.
As an alternative embodiment, the apparatus further comprises:
the third acquisition unit is used for acquiring the voiceprint characteristics of the training object and the voice characteristics of the training voice data before determining the target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristics and the first voice characteristics;
a second determining unit, configured to determine a second speech frame and a third speech frame according to a speech boundary of a training object in training speech data, where the second speech frame is a speech frame belonging to the training object in the training speech data, and the third speech frame is a speech frame other than the second speech frame in the training speech data;
the input unit is used for inputting the voiceprint characteristics of the training object and the voice characteristics of the training voice data into the initial voice activity detection model to obtain the probability that each voice frame in the training voice data output by the initial voice activity detection model belongs to the training object;
and the adjusting unit is used for adjusting the model parameters of the initial voice activity detection model to obtain a target voice activity detection model, wherein the probability that the second voice frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third voice frame belongs to the training object is smaller than the target probability threshold.
As an alternative embodiment, the evaluation unit 708 includes:
the acquisition module is used for acquiring a second voice characteristic of the first voice data;
the second input module is used for inputting the second voice characteristics into the voice recognition model to obtain a decoding result output by the voice recognition model, wherein the decoding result is used for indicating the probability that each pronunciation unit of the first voice data is the corresponding target pronunciation unit;
and the determining module is used for determining a target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.
According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the method for evaluating the spoken language of the object, where the electronic device may be a server, a terminal, or a combination thereof.
Fig. 8 is a block diagram of an alternative electronic device according to an embodiment of the present application, and as shown in fig. 8, the electronic device includes a memory 802 and a processor 804, the memory 802 stores a computer program, and the processor 804 is configured to execute steps in any of the method embodiments described above through the computer program.
Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object;
s2, determining a target voice boundary of a target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic;
s3, acquiring first voice data belonging to a target object from the voice data to be evaluated according to the target voice boundary;
and S4, carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for evaluating the spoken language of the object in the embodiment of the present invention, and the processor 804 executes various functional applications and data processing by running the software programs and modules stored in the memory 802, so as to implement the method for evaluating the spoken language of the object. The memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 802 can further include memory located remotely from the processor 804, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be, but is not limited to being, used to store voice data, voice prints, object information, model data, and the like, among other things.
As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, a first obtaining unit 702, a first determining unit 704, a second obtaining unit 706, and an evaluating unit 708 of the spoken language evaluating apparatus including the object. In addition, the device may further include, but is not limited to, other module units in the spoken language evaluation device of the above object, which is not described in this example again.
Optionally, the transmitting device 806 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 806 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 806 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In addition, the electronic device further includes: a display 808 for a display interface of the object client; and a connection bus 810 for connecting the respective module components in the electronic device.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
It can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the device implementing the spoken language evaluation method of the object may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Alternatively, in this embodiment, the storage medium may be used to execute a program code of a spoken language evaluation method of an object.
Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:
s1, acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object;
s2, determining a target voice boundary of a target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic;
s3, acquiring first voice data belonging to a target object from the voice data to be evaluated according to the target voice boundary;
and S4, carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method for evaluating a spoken language of an object is characterized by comprising the following steps:
acquiring a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, wherein the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object;
determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint characteristic and the first voice characteristic;
acquiring first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary;
and carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
2. The method according to claim 1, wherein the obtaining of the target voiceprint feature of the target object and the first speech feature of the speech data to be evaluated comprises:
extracting the target voiceprint features of the target object from a voiceprint library, wherein the voiceprint library stores the voiceprint features of a plurality of objects, and the plurality of objects comprise the target object;
and extracting the voice features of the voice data to be evaluated to obtain the first voice features.
3. The method of claim 2, wherein prior to said extracting the target voiceprint features of the target object from a voiceprint library, the method further comprises:
displaying registration prompt information through a client of the target object, wherein the registration prompt information is used for prompting the target object to register voiceprint;
receiving second voice data returned by the client, wherein the second voice data is the voice data input by the target object responding to the registration prompt message;
extracting the voiceprint characteristics of the second voice data to obtain the target voiceprint characteristics;
and saving the target voiceprint characteristics into the voiceprint library.
4. The method according to claim 1, wherein the determining the target speech boundary of the target object in the speech data to be evaluated according to the target voiceprint feature and the first speech feature comprises:
inputting the target voiceprint feature and the first voice feature into a target voice activity detection model to obtain the target voice boundary output by the target voice activity detection model, wherein the target voice activity detection model is obtained by training an initial voice activity detection model by using a voiceprint feature of a training object and a voice feature of training voice data, and the training voice data is voice data marked with the voice boundary of the training object.
5. The method of claim 4, wherein inputting the target voiceprint feature and the first speech feature into a target voice activity detection model, and obtaining the target speech boundary output by the target voice activity detection model comprises:
inputting the target voiceprint feature and the first voice feature into the target voice activity detection model to obtain the probability that each voice frame in the voice data to be evaluated belongs to the target object, wherein the probability is determined by the target voice activity detection model;
determining a first voice frame belonging to the target object in the voice data to be evaluated, wherein the first voice frame is a voice frame of which the probability of belonging to the target object in the voice data to be evaluated is greater than or equal to a target probability threshold;
and outputting the target voice boundary of the target object in the voice data to be evaluated according to the first voice frame.
6. The method according to claim 4, wherein before said determining a target speech boundary of said target object in said speech data to be evaluated according to said target voiceprint feature and said first speech feature, said method further comprises:
acquiring voiceprint characteristics of the training object and voice characteristics of the training voice data;
determining a second voice frame and a third voice frame according to the voice boundary of the training object in the training voice data, wherein the second voice frame is a voice frame belonging to the training object in the training voice data, and the third voice frame is other voice frames except the second voice frame in the training voice data;
inputting the voiceprint feature of the training object and the voice feature of the training voice data into the initial voice activity detection model to obtain the probability that each voice frame in the training voice data output by the initial voice activity detection model belongs to the training object;
and adjusting model parameters of the initial voice activity detection model to obtain the target voice activity detection model, wherein the probability that the second voice frame output by the target voice activity detection model belongs to the training object is greater than or equal to a target probability threshold, and the probability that the third voice frame belongs to the training object is smaller than the target probability threshold.
7. The method according to any one of claims 1 to 6, wherein the performing spoken language assessment on the target object by using the first speech data to obtain a target assessment result of the target object comprises:
acquiring a second voice characteristic of the first voice data;
inputting the second voice characteristics into a voice recognition model to obtain a decoding result output by the voice recognition model, wherein the decoding result is used for indicating the probability that each pronunciation unit of the first voice data is a corresponding target pronunciation unit;
and determining the target evaluation result of the target object according to the probability that each pronunciation unit indicated by the decoding result is the corresponding target pronunciation unit.
8. A spoken language evaluation apparatus for an object, comprising:
the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining a target voiceprint feature of a target object and a first voice feature of voice data to be evaluated, and the voice data to be evaluated is voice data used for carrying out spoken language evaluation on the target object;
the first determining unit is used for determining a target voice boundary of the target object in the voice data to be evaluated according to the target voiceprint feature and the first voice feature;
the second obtaining unit is used for obtaining first voice data belonging to the target object from the voice data to be evaluated according to the target voice boundary;
and the evaluation unit is used for carrying out spoken language evaluation on the target object by using the first voice data to obtain a target evaluation result of the target object.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.
CN202010871713.0A 2020-08-26 2020-08-26 Method and device for evaluating spoken language of object, storage medium and electronic device Pending CN111986680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010871713.0A CN111986680A (en) 2020-08-26 2020-08-26 Method and device for evaluating spoken language of object, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010871713.0A CN111986680A (en) 2020-08-26 2020-08-26 Method and device for evaluating spoken language of object, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN111986680A true CN111986680A (en) 2020-11-24

Family

ID=73439789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010871713.0A Pending CN111986680A (en) 2020-08-26 2020-08-26 Method and device for evaluating spoken language of object, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN111986680A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205713A (en) * 2021-04-19 2021-08-03 临沂职业学院 Method and device for assisting word recitation and mobile terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346428A (en) * 2017-09-13 2018-07-31 腾讯科技(深圳)有限公司 Voice activity detection and its method for establishing model, device, equipment and storage medium
CN109272992A (en) * 2018-11-27 2019-01-25 北京粉笔未来科技有限公司 A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model
CN110136749A (en) * 2019-06-14 2019-08-16 苏州思必驰信息科技有限公司 The relevant end-to-end speech end-point detecting method of speaker and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346428A (en) * 2017-09-13 2018-07-31 腾讯科技(深圳)有限公司 Voice activity detection and its method for establishing model, device, equipment and storage medium
CN109272992A (en) * 2018-11-27 2019-01-25 北京粉笔未来科技有限公司 A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model
CN110136749A (en) * 2019-06-14 2019-08-16 苏州思必驰信息科技有限公司 The relevant end-to-end speech end-point detecting method of speaker and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205713A (en) * 2021-04-19 2021-08-03 临沂职业学院 Method and device for assisting word recitation and mobile terminal

Similar Documents

Publication Publication Date Title
US11276407B2 (en) Metadata-based diarization of teleconferences
CN108766418B (en) Voice endpoint recognition method, device and equipment
CN106057206B (en) Sound-groove model training method, method for recognizing sound-groove and device
CN105489222B (en) Audio recognition method and device
CN110136749A (en) The relevant end-to-end speech end-point detecting method of speaker and device
CN110136727A (en) Speaker's personal identification method, device and storage medium based on speech content
CN106537493A (en) Speech recognition system and method, client device and cloud server
CN111816218A (en) Voice endpoint detection method, device, equipment and storage medium
CN110570853A (en) Intention recognition method and device based on voice data
US9799325B1 (en) Methods and systems for identifying keywords in speech signal
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
CN111667835A (en) Voice recognition method, living body detection method, model training method and device
CN110990685A (en) Voice search method, voice search device, voice search storage medium and voice search device based on voiceprint
CN110704618B (en) Method and device for determining standard problem corresponding to dialogue data
US20180308501A1 (en) Multi speaker attribution using personal grammar detection
US20210118464A1 (en) Method and apparatus for emotion recognition from speech
US9454959B2 (en) Method and apparatus for passive data acquisition in speech recognition and natural language understanding
CN110600032A (en) Voice recognition method and device
CN111161746B (en) Voiceprint registration method and system
CN112201275A (en) Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN111081260A (en) Method and system for identifying voiceprint of awakening word
CN111312286A (en) Age identification method, age identification device, age identification equipment and computer readable storage medium
CN109273012B (en) Identity authentication method based on speaker recognition and digital voice recognition
CN107886940B (en) Voice translation processing method and device
CN111986680A (en) Method and device for evaluating spoken language of object, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination