CN108510981B - Method and system for acquiring voice data - Google Patents

Method and system for acquiring voice data Download PDF

Info

Publication number
CN108510981B
CN108510981B CN201810324045.2A CN201810324045A CN108510981B CN 108510981 B CN108510981 B CN 108510981B CN 201810324045 A CN201810324045 A CN 201810324045A CN 108510981 B CN108510981 B CN 108510981B
Authority
CN
China
Prior art keywords
voice data
voice
recognition model
application object
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810324045.2A
Other languages
Chinese (zh)
Other versions
CN108510981A (en
Inventor
谢晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics China R&D Center, Samsung Electronics Co Ltd filed Critical Samsung Electronics China R&D Center
Priority to CN201810324045.2A priority Critical patent/CN108510981B/en
Publication of CN108510981A publication Critical patent/CN108510981A/en
Priority to KR1020190035388A priority patent/KR20190119521A/en
Priority to US16/382,712 priority patent/US10984795B2/en
Application granted granted Critical
Publication of CN108510981B publication Critical patent/CN108510981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/34Microprocessors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/36Memories
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/38Displays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/12Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a method and a system for acquiring voice data, which comprises the following steps: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data; detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, continuously judging whether the first voice data comes from an application object of the voice recognition model, if so, marking the first voice data as application object voice data, and marking the second voice data as non-application object voice data; and if not, marking the first voice data and the second voice data as non-application object voice data. Based on the method of the invention, the burden of training the voice recognition model of the user is reduced and the user experience is improved by improving the voice acquisition method.

Description

Method and system for acquiring voice data
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method and a system for acquiring voice data.
Background
The speech recognition of the mobile terminal is divided into two categories of semantic recognition and speaker recognition
Speaker recognition is commonly referred to as voiceprint recognition. Generally, the method is divided into Text-dependent (Text-dependent) and Text-independent (Text-independent).
Text-dependent speech recognition typically requires the user to repeat the reading with the fixed word 2-3 times. Record the relevant characteristic information as a registration (entry). When in use, the user is also required to pronounce the same fixed words and phrases for voice judgment (Predict).
Non-text dependent speech recognition does not require the user to follow a fixed sentence. The user inputs a large amount of voice data as training (Train) of machine learning, and the characteristic information of the user is highly purified under the training of the large amount of data. The speech data used for training needs to contain the speech data of the user (application object of the speech recognition model) and speech data of other people. Fixed words and sentences do not need to be read during voice judgment. Normal speech data can be used for speech discrimination.
In the prior art, the mobile intelligent terminal cannot distinguish the user identities in the voice recognition process, and does not distinguish the voice characteristic values of different users, so that the same mobile intelligent terminal can simultaneously serve the voice instructions of different users, and the confidentiality and the specificity are poor.
Taking the voice assistant as an example, the existing mobile intelligent terminal needs to have a fixed wake-up process when the voice assistant service is enabled. This is a drawback of text-related speech recognition, and cannot be separated from the constraint of fixed text, and cannot achieve a quick response to any speech instruction of the user (application object). All voice instructions need to be available after the voice assistant is awakened. Any user can wake up the voice assistant through the fixed words and sentences and send out voice instructions, the voice assistant cannot perform voice recognition on the user identity, and all the voice instructions can be executed.
The non-text related speech recognition utilizes a machine learning technology, and highly purified user characteristic information and model parameters are obtained by establishing a complete learning model and inputting and training a large amount of speech data. Based on the trained model, the user can realize the speaker voice recognition with high accuracy rate through any voice input without the limitation of fixed text.
However, to realize non-text-related speech recognition on a mobile intelligent terminal, a large amount of speech data of registered persons and non-registered persons needs to be acquired. The training process is lengthy and tedious. The user experience is a great challenge. The user does not want to spend time and effort inputting voice data. In addition, obtaining speech data of application objects other than the speech recognition model is an embarrassing problem for the end user. Without sufficient training data, high recognition accuracy cannot be achieved. Therefore, no non-text related speech recognition system exists in the existing mobile intelligent terminal.
In order to solve the above problems, especially a method for acquiring voice data in a non-text voice recognition model applied to a terminal, no effective solution is provided at present.
Disclosure of Invention
The invention provides a method and a system for acquiring voice data, which can reduce the burden of a user by improving the acquisition process of the voice data.
The invention provides a method for acquiring voice data, wherein the voice data is used for training a voice recognition model, and the method comprises the following steps
Step A-1: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
step A-2: detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, executing the step A-3;
step A-3: judging whether the first voice data come from an application object of the voice recognition model, if so, executing the step A-4, and if not, executing the step A-5;
step A-4: marking the first voice data as application object voice data, marking the second voice data as non-application object voice data, and using the application object voice data for the voice feature learning of an application object in a voice recognition model; the non-application object voice data is used for learning the voice characteristics of the non-application object in the voice recognition model;
step A-5: the first voice data and the second voice data are marked as non-application object voice data.
The invention also provides a system for acquiring voice data, wherein the voice data is used for training a voice recognition model, and the system comprises:
a storage module: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
a detection module: detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, executing a user judgment module;
a user judgment module: judging whether the first voice data comes from an application object of the voice recognition model, if so, executing a voice object marking module 1, and otherwise, executing a voice object marking module 2;
voice object flag 1: marking the first voice data as application object voice data, marking the second voice data as non-application object voice data, and using the application object voice data for the voice feature learning of an application object in a voice recognition model; the non-application object voice data is used for learning the voice characteristics of the non-application object in the voice recognition model;
voice object marker 2: the first voice data and the second voice data are marked as non-application object voice data.
According to the invention, the voice data of the user during voice call is stored, the input voice data (first voice data) of the microphone is used for the voice feature learning of the application object in the voice recognition model, the output voice data (second voice data) of the receiver is used for the voice feature learning of the non-application object in the voice recognition model, and the training voice data is transmitted to the voice recognition model in a silent mode at the background of the mobile intelligent terminal, so that the user does not need to do tedious and complicated input work, the training burden of the user is reduced, and the user experience is improved. Meanwhile, the method and the system can be applied to any speech recognition model based on the neural network, and have wide application range. According to the voice data acquisition method and the voice data acquisition system, non-text-related voice recognition can be realized on the mobile intelligent terminal, the limitation of the existing text-related voice recognition is broken through, the terminal can more intelligently understand the characteristics and the use habits of all users, and the specificity and the safety are enhanced.
Drawings
FIG. 1 is a flow chart of a method of obtaining speech data according to the present invention;
FIG. 2 is one embodiment of FIG. 1;
FIG. 3 is a block diagram of a voice data acquisition system of the present invention;
fig. 4 is an embodiment of fig. 3.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart of a method for acquiring voice data according to the present invention, which includes the following steps:
step A-1 (S101): when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
step A-2 (S102): detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, executing the step A-3;
step A-3 (S103): judging whether the first voice data come from an application object of the voice recognition model, if so, executing the step A-4, and if not, executing the step A-5;
step A-4 (S104): marking the first voice data as application object voice data, marking the second voice data as non-application object voice data, and using the application object voice data for the voice feature learning of an application object in a voice recognition model; the non-application object voice data is used for learning the voice characteristics of the non-application object in the voice recognition model;
step A-5 (S105): the first voice data and the second voice data are marked as non-application object voice data.
In step a-1, the voice call includes not only an audio call but also a video call such as VoIP, Vo L TE, and also includes a real-time audio/video call of other instant messaging apps, such as a wechat "video chat" or "voice chat".
When the user starts a voice call, the method of fig. 1 is triggered to be executed. When applied to WeChat, QQ, the method of FIG. 1 is triggered to execute when a corresponding action is detected, such as a "video chat" or "voice chat" button is pressed or activated.
Step A-1, storing the work of the voice data stream, which can be arranged in a hardware equipment operation layer in a mobile intelligent terminal operation system, when a user starts voice communication, backing up and storing the input voice data of a microphone and the output voice data of a receiver in real time in the hardware operation layer of the system, wherein the input voice data of the microphone represents the voice data of the terminal user, and the output voice data of the receiver represents the voice data transmitted to the terminal user by an opposite terminal in real time.
Taking the Android system as an example, the operation layer of the hardware device is the Android HA L, the call state determination may refer to the relevant attribute of call _ connected in tiny _ audio _ device in the AudioHA L, and when adev- > call _ connected is true, it indicates that the call is opened and in a voice call.
IN the AudioHA L, when the AUDIO _ hw _ DEVICE is AUDIO _ DEVICE _ IN _ BUI L TIN _ MIC, it indicates that the current microphone DEVICE is IN an operating state.
Further, the voice data before the output of the handset is backed up and saved in the out _ write () function of the AudioHA L, write corresponding to the playback process.
In addition, in the step A-1, the first voice data and the second voice data can be saved in a ROM or a RAM of the mobile intelligent terminal.
The speech feature value extraction technology of the neural network speech recognition model needs to input a large amount of personal speech in advance for training so as to obtain the voiceprint features of the speaker. In the existing method, a designated program enables a user to input voice data one sentence by one sentence for training, extra time of the user needs to be consumed for specially performing personal voice characteristic training, and the process is complicated and tedious.
In the training method shown in fig. 1, the intelligent terminal is used for collecting the instant messaging voice data generated by the user in daily work and life, and the collected voice data is used for training the voice recognition model so as to continuously improve the recognition accuracy of the voice recognition model. Compared with the prior art, the user does not need to do tedious and complicated input work, the training burden of the user is reduced, and the user experience is improved.
Meanwhile, by using the training method shown in fig. 1 of the application, the user can accumulate day by day, so that the non-text related speech recognition can be realized on the mobile intelligent terminal, the limitation of the existing text related speech recognition is broken through, the terminal can more intelligently understand the characteristics and the use habits of each user, and the specificity and the safety are enhanced.
Taking a voice assistant as an example, after a non-text voice recognition function is added, the identification of the user identity can be realized, and then only the voice instruction of the voice recognition model application object is processed, so that the safety and the specificity are enhanced.
In step a-2 of fig. 1 of the present application, it is detected whether the first voice data and the second voice data meet the model training requirement, for example, it may be detected whether the first voice data and the second voice data include a non-silent feature, if not, it indicates that no voice is recorded in the first voice data and the second voice data, and the model training requirement is not met, if so, it continues to detect whether the voice in the first voice data and the second voice data is clear, and if not, it is not beneficial to the model training and does not meet the model training requirement.
Optionally, in step a-2 of fig. 1, before step a-3 is executed, the first voice data and the second voice data may be further subjected to voice cleaning, and after the voice cleaning, step a-3 is executed. The voice cleaning comprises denoising, denoising and the like, so that the first voice data and the second voice data have better quality, and further the model training effect is improved.
In step a-3 of fig. 1 of the present application, it is determined whether the speech of the first speech data is from an application object of the speech recognition model, and a face recognition or fingerprint determination or an output dialog box may be adopted to allow the user to confirm the identity. Taking face recognition as an example, actively collecting face information of a user through a camera, judging whether the face information is an application object of a voice recognition model through comparison, and prompting the user to input if the face information is failed to be collected. In the fingerprint determination, a user is generally prompted to input a fingerprint of a designated finger, and whether the fingerprint is an application object of the speech recognition model is determined by comparison.
In fig. 1 of the present application, in order to save the training time of the speech recognition model, the generic feature module, particularly the non-application object speech feature module, in the speech recognition model may be trained in advance.
On the other hand, in order to avoid occupying terminal system resources and revealing user privacy, after the step a-4 or the step a-5, the first voice data and the second voice data are used for training the voice recognition model immediately, and after the training is finished, if the first voice data and the second voice data are not updated, the first voice data and the second voice data are cleared and the process of fig. 1 is exited. Or after the training is finished, clearing the related voice data and exiting the flow of fig. 1.
Fig. 2 is an extension of the method of fig. 1, and shows an embodiment of a specific application, including the following steps:
step A-11 (S201): when a user carries out voice communication, saving an input voice data stream of a microphone as third voice data, and saving an output voice data stream of a receiver as fourth voice data;
step A-12 (S202): and when the third voice data reaches the preset time length, enabling the first voice data to be equal to the third voice data and enabling the third voice data to be empty, executing the step A-2, and returning to the step A-11.
Step A-13 (S203): and when the voice of the fourth voice data reaches the preset duration, enabling the second voice data to be equal to the fourth voice data, enabling the fourth voice data to be empty, executing the step A-2, and returning to the step A-11.
Step A-2 (S204): detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, executing the step A-3;
step A-31 (S205): judging whether the first voice data comes from an application object of the voice recognition model by using the voice recognition model, and outputting (judging) the confidence of the result; if the confidence is less than the threshold, performing step A-32; if the judgment result is the application object of the voice recognition model and the confidence coefficient is larger than or equal to the threshold value, executing the step A-4; if the judgment result is not the application object of the voice recognition model and the confidence coefficient is larger than or equal to the threshold value, executing the step A-5;
step A-32 (S206): in the voice call, whether the user confirms the identity or not is judged, if not, the user is judged whether the user is the application object of the voice recognition model or not and the confirmation result of the user is recorded, if the user is the application object of the voice recognition model, the step A-4 is executed, and if the user is not the application object of the voice recognition model, the step A-5 is executed.
Step A-4 (S207): marking the first voice data as application object voice data, marking the second voice data as non-application object voice data, and using the application object voice data for the voice feature learning of an application object in a voice recognition model; the non-application object voice data is used for learning the voice characteristics of the non-application object in the voice recognition model;
step A-5 (S208): the first voice data and the second voice data are marked as non-application object voice data.
In the method shown in fig. 1 of the present application, all data of a voice call may be saved as first voice data and second voice data, and then a voice recognition model is trained based on the data, or the method shown in fig. 2 may be configured to save and train at the same time, and meanwhile, steps a-11 to a-13 in fig. 2 may also be replaced with a-1 in fig. 1, and selected according to actual needs.
In steps a-12 and a-13 of fig. 2, the preset time is greater than 10 seconds, or greater than the time taken to perform steps a-2 to a-5 of fig. 2.
In step a-31 of fig. 2 of the present application, the user identity authentication is not performed by face recognition or fingerprint recognition, but the user authentication is performed by the speech recognition model itself, considering that the speech recognition model is initially trained and has a large judgment error, and assisting the user to manually confirm the identity, when the model recognition accuracy is higher and higher, manual participation is not required, the method of fig. 2 of the present application can be implemented to run in the background in a "silent" manner, and the speech recognition model is continuously trained on the premise that the user does not perceive it.
The present invention also includes a system for acquiring voice data, as shown in fig. 3, including:
a storage module: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
a detection module: detecting whether the first voice data and the second voice data meet the training requirements of the voice recognition model, if so, executing a user judgment module;
a user judgment module: judging whether the first voice data comes from an application object of the voice recognition model, if so, executing a voice object marking module 1, and otherwise, executing a voice object marking module 2;
voice object flag 1: marking the first voice data as application object voice data, marking the second voice data as non-application object voice data, and using the application object voice data for the voice feature learning of an application object in a voice recognition model; the non-application object voice data is used for learning the voice characteristics of the non-application object in the voice recognition model;
voice object marker 2: the first voice data and the second voice data are marked as non-application object voice data.
Alternatively, as shown in fig. 4, the recording module may alternatively include: a cycle recording module, a transfer module 1 and a transfer module 2.
A cycle recording module: storing the input voice data stream of the microphone as third voice data, and storing the output voice data stream of the receiver as fourth voice data;
the transmission module 1: and when the third voice data reaches the preset time length, the first voice data is made equal to the third voice data, meanwhile, the third voice data is made empty, the detection module is executed, and meanwhile, the detection module returns to the circulating recording module.
The transmission module 2: and when the voice of the fourth voice data reaches the preset duration, making the second voice data equal to the fourth voice data, and making the fourth voice data empty, executing the detection module, and returning to the circulating recording module.
Optionally, as shown in fig. 4, the user determination module may alternatively include: a speech recognition model user judgment module and a user confirmation module,
the voice recognition model user judgment module: judging whether the first voice data comes from an application object of the voice recognition model or not by using the voice recognition model, and outputting the confidence coefficient of the result; if the confidence is less than the threshold, executing a user confirmation module; if the judgment result is that the speech recognition model is applied to the object and the confidence coefficient is greater than or equal to the threshold value, executing a speech object mark 1; if the judgment result is not the application object of the voice recognition model and the confidence coefficient is larger than or equal to the threshold value, executing a voice object mark 2;
a user confirmation module: in the voice call, whether the user confirms the identity or not is judged, if not, the user is judged whether the user is an application object of the voice recognition model or not, the confirmation result of the user is recorded, if the user is the application object of the voice recognition model, the voice object mark 1 is executed, and if the user is not the application object of the voice recognition model, the voice object mark 2 is executed.
Optionally, in the detecting module, detecting whether the first speech data and the second speech data meet the training requirement of the speech recognition model includes:
and detecting whether the first voice data and the second voice data contain non-mute characteristics, if not, not meeting the requirement of model training, if so, continuously detecting whether the voices in the first voice data and the second voice data are clear, and if not, not meeting the requirement of model training.
Optionally, in the detecting module, if yes, the executing the user determining module includes:
and if so, executing a user judgment module after carrying out voice cleaning on the first voice data and the second voice data.
It should be noted that the embodiment of the voice data acquisition system of the present invention has the same principle as the embodiment of the voice data acquisition method, and the relevant points can be referred to each other.
The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for obtaining speech data, wherein the speech data is used for training a speech recognition model, the method comprising the steps of:
step A-1: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
step A-2: detecting whether the first voice data and the second voice data meet the training requirements of a voice recognition model, if so, executing the step A-3;
step A-3: judging whether the first voice data come from an application object of the voice recognition model, if so, executing the step A-4, and if not, executing the step A-5;
step A-4: marking the first voice data as application object voice data, and marking the second voice data as non-application object voice data, wherein the application object voice data is used for voice feature learning of an application object in the voice recognition model; the non-application object voice data is used for voice feature learning of a non-application object in the voice recognition model;
step A-5: and marking the first voice data and the second voice data as the non-application object voice data.
2. The method of claim 1, wherein in the step a-2, if so, performing the step a-3 comprises:
and if so, performing voice cleaning on the voice of the first voice data and the voice of the second voice data, and then executing the step A-3.
3. The method of claim 1, wherein saving an input voice data stream of a microphone as first voice data and an output voice data stream of an earpiece as second voice data comprises:
step A-11: saving the input voice data stream of the microphone as third voice data, saving the output voice data stream of the receiver as fourth voice data, and executing the step A-12 and the step A-13;
step A-12: when the third voice data reaches the preset time length, enabling the first voice data to be equal to the third voice data, enabling the third voice data to be empty, executing the step A-2, and returning to the step A-11;
step A-13: and when the voice of the fourth voice data reaches the preset duration, enabling the second voice data to be equal to the fourth voice data, enabling the fourth voice data to be empty, executing the step A-2, and returning to the step A-11.
4. The method of claim 1, wherein the step a-3 further comprises:
step A-31: judging whether the first voice data comes from an application object of the voice recognition model or not by using the voice recognition model, and outputting the confidence coefficient of a judgment result; if the confidence is less than the threshold, performing step A-32; if the judgment result is the application object of the voice recognition model and the confidence coefficient is larger than or equal to the threshold value, executing the step A-4; if the judgment result is not the application object of the speech recognition model and the confidence coefficient is larger than or equal to the threshold value, executing the step A-5;
step A-32: in the voice call, whether the user confirms the identity or not is judged, if not, the user is judged whether the user is the application object of the voice recognition model or not and the confirmation result of the user is recorded, if the user is the application object of the voice recognition model, the step A-4 is executed, and if the user is not the application object of the voice recognition model, the step A-5 is executed.
5. The method of claim 1, wherein detecting whether the first speech data and the second speech data comply with speech recognition model training requirements comprises:
and detecting whether the first voice data and the second voice data contain non-mute characteristics or not, if not, not meeting the requirement of model training, if so, continuously detecting whether the voices in the first voice data and the second voice data are clear or not, and if not, not meeting the requirement of model training.
6. A system for obtaining speech data for use in training a speech recognition model, the system comprising:
a storage module: when a user carries out voice call, a voice data stream transmitted in real time in the intelligent terminal system is stored, an input voice data stream of a microphone is stored as first voice data, and an output voice data stream of a receiver is stored as second voice data;
a detection module: detecting whether the first voice data and the second voice data meet the training requirements of a voice recognition model, if so, executing a user judgment module;
a user judgment module: judging whether the first voice data come from an application object of the voice recognition model, if so, executing a voice object marking module 1, and otherwise, executing a voice object marking module 2;
voice object flag 1: marking the first voice data as application object voice data, and marking the second voice data as non-application object voice data, wherein the application object voice data is used for voice feature learning of an application object in the voice recognition model; the non-application object voice data is used for voice feature learning of a non-application object in the voice recognition model;
voice object marker 2: and marking the first voice data and the second voice data as the non-application object voice data.
7. The system of claim 6, wherein in the detecting module, if yes, the executing the user determination module further comprises:
and if so, executing a user judgment module after carrying out voice cleaning on the first voice data and the second voice data.
8. The system of claim 6, wherein the save module further comprises:
a cycle recording module: saving the input voice data stream of the microphone as third voice data, saving the output voice data stream of the receiver as fourth voice data, and executing a transmission module 1 and a transmission module 2;
the transmission module 1: when the third voice data reaches a preset time length, enabling the first voice data to be equal to the third voice data, enabling the third voice data to be empty, executing a detection module, and returning to a circulating recording module;
the transmission module 2: and when the voice of the fourth voice data reaches the preset duration, making the second voice data equal to the fourth voice data, and making the fourth voice data empty, executing the detection module, and returning to the circulating recording module.
9. The system of claim 6, wherein the user determination module further comprises:
the voice recognition model user judgment module: judging whether the first voice data comes from an application object of the voice recognition model or not by using the voice recognition model, and outputting the confidence coefficient of a judgment result; if the confidence is less than the threshold, executing a user confirmation module; if the judgment result is the application object of the voice recognition model and the confidence coefficient is larger than or equal to a threshold value, executing a voice object mark 1; if the judgment result is not the application object of the voice recognition model and the confidence coefficient is larger than or equal to a threshold value, executing a voice object mark 2;
a user confirmation module: in the voice call, whether the user confirms the identity or not is judged, if not, the user is judged whether the user is the application object of the voice recognition model or not, the confirmation result of the user is recorded, if the user is the application object of the voice recognition model, the voice object mark 1 is executed, and if the user is not the application object of the voice recognition model, the voice object mark 2 is executed.
10. The system of claim 6, wherein the detecting module detects whether the first speech data and the second speech data conform to speech recognition model training requirements comprises:
and detecting whether the first voice data and the second voice data contain non-mute characteristics or not, if not, not meeting the requirement of model training, if so, continuously detecting whether the voices in the first voice data and the second voice data are clear or not, and if not, not meeting the requirement of model training.
CN201810324045.2A 2018-04-12 2018-04-12 Method and system for acquiring voice data Active CN108510981B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201810324045.2A CN108510981B (en) 2018-04-12 2018-04-12 Method and system for acquiring voice data
KR1020190035388A KR20190119521A (en) 2018-04-12 2019-03-27 Electronic apparatus and operation method thereof
US16/382,712 US10984795B2 (en) 2018-04-12 2019-04-12 Electronic apparatus and operation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810324045.2A CN108510981B (en) 2018-04-12 2018-04-12 Method and system for acquiring voice data

Publications (2)

Publication Number Publication Date
CN108510981A CN108510981A (en) 2018-09-07
CN108510981B true CN108510981B (en) 2020-07-24

Family

ID=63381824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810324045.2A Active CN108510981B (en) 2018-04-12 2018-04-12 Method and system for acquiring voice data

Country Status (2)

Country Link
KR (1) KR20190119521A (en)
CN (1) CN108510981B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113016030A (en) * 2018-11-06 2021-06-22 株式会社赛斯特安国际 Method and device for providing voice recognition service
KR20220120197A (en) * 2021-02-23 2022-08-30 삼성전자주식회사 Electronic apparatus and controlling method thereof
EP4207805A4 (en) * 2021-02-23 2024-04-03 Samsung Electronics Co Ltd Electronic device and control method thereof

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002196781A (en) * 2000-12-26 2002-07-12 Toshiba Corp Voice interactive system and recording medium used for the same
JP5158174B2 (en) * 2010-10-25 2013-03-06 株式会社デンソー Voice recognition device
CN104517587B (en) * 2013-09-27 2017-11-24 联想(北京)有限公司 A kind of screen display method and electronic equipment
KR102274317B1 (en) * 2013-10-08 2021-07-07 삼성전자주식회사 Method and apparatus for performing speech recognition based on information of device
CN103956169B (en) * 2014-04-17 2017-07-21 北京搜狗科技发展有限公司 A kind of pronunciation inputting method, device and system
CN105976820B (en) * 2016-06-14 2019-12-31 上海质良智能化设备有限公司 Voice emotion analysis system

Also Published As

Publication number Publication date
CN108510981A (en) 2018-09-07
KR20190119521A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN110428810B (en) Voice wake-up recognition method and device and electronic equipment
CN103888581B (en) A kind of communication terminal and its method for recording call-information
KR101498347B1 (en) System and method of smart audio logging for mobile devices
CN110047481B (en) Method and apparatus for speech recognition
CN108510981B (en) Method and system for acquiring voice data
CN112037799B (en) Voice interrupt processing method and device, computer equipment and storage medium
CN110136727A (en) Speaker's personal identification method, device and storage medium based on speech content
CN106710593B (en) Method, terminal and server for adding account
US8521525B2 (en) Communication control apparatus, communication control method, and non-transitory computer-readable medium storing a communication control program for converting sound data into text data
CN104811559B (en) Noise-reduction method, communication means and mobile terminal
CN106302933B (en) Voice information processing method and terminal
JP6585733B2 (en) Information processing device
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
CN112802498B (en) Voice detection method, device, computer equipment and storage medium
CN110556114B (en) Speaker identification method and device based on attention mechanism
JP5988077B2 (en) Utterance section detection apparatus and computer program for detecting an utterance section
CN108093350A (en) The control method and microphone of microphone
CN110197663B (en) Control method and device and electronic equipment
CN109379499A (en) A kind of voice call method and device
CN112885341A (en) Voice wake-up method and device, electronic equipment and storage medium
CN112151070B (en) Voice detection method and device and electronic equipment
Roshan et al. Capturing important information from an audio conversation
CN115705840A (en) Voice wake-up method and device, electronic equipment and readable storage medium
CN117894321A (en) Voice interaction method, voice interaction prompting system and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant