CN111477218A - Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium - Google Patents

Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium Download PDF

Info

Publication number
CN111477218A
CN111477218A CN202010302149.0A CN202010302149A CN111477218A CN 111477218 A CN111477218 A CN 111477218A CN 202010302149 A CN202010302149 A CN 202010302149A CN 111477218 A CN111477218 A CN 111477218A
Authority
CN
China
Prior art keywords
function
data
recording
voice recognition
resampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010302149.0A
Other languages
Chinese (zh)
Inventor
杨华东
董玲玲
于绞龙
王盟盟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunderstone Technology Co ltd
Original Assignee
Beijing Thunderstone Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunderstone Technology Co ltd filed Critical Beijing Thunderstone Technology Co ltd
Priority to CN202010302149.0A priority Critical patent/CN111477218A/en
Publication of CN111477218A publication Critical patent/CN111477218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention relates to the technical field of speech signal processing, and provides a multi-speech recognition method, a multi-speech recognition device, a multi-speech recognition terminal and a non-transitory computer-readable storage medium, so that when speech recognition is required, the speech can be accurately recognized. The method comprises the following steps: acquiring recording data through a recording component built in the terminal; recording data collected by the recording component is copied and then stored; according to the function to be realized at present, resampling the copied recording data by adopting the audio data sampling rate corresponding to the function to be realized at present; and transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition. The technical scheme provided by the invention avoids the defect that a single voice recognition library is used for recognizing various different voices in the prior art, so that different voice data are recognized by different voice recognition libraries, and the accuracy of voice recognition can be improved.

Description

Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a multi-speech recognition method, an apparatus terminal, and a non-transitory computer-readable storage medium.
Background
Artificial intelligence has been widely used in many industries, and in the KTV industry, artificial intelligence has been used in jukeboxes at present, and can realize control of jukebox functions such as jukebox, play control, light switching and the like on current equipment. At present, the control instruction of voice input is received through a box terminal, and the voice recognition library therein realizes the control function.
However, the control function of the jukebox described above is implemented only by a speech recognition library. When different voices need to be recognized, the recognition precision is greatly reduced, and even the recognition is wrong.
Disclosure of Invention
The present invention provides a multi-speech recognition method, apparatus terminal and non-transitory computer-readable storage medium to accurately recognize speech when it is necessary to recognize such speech.
In one aspect, the present invention provides a multi-speech recognition method, including:
acquiring recording data through a recording component built in the terminal;
copying and storing the recording data acquired by the recording component;
according to the function to be realized currently, resampling the copied recording data by adopting an audio data sampling rate corresponding to the function to be realized currently;
and transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition.
Specifically, the copying and storing the recording data acquired by the recording component includes:
and copying and storing the recording data acquired by the recording component according to the principle of optimizing the resources of the memory and the central processing unit.
Specifically, the copying and storing the recording data collected by the recording component according to the principle of optimizing the resources of the memory and the central processing unit includes:
identifying the function to be realized currently, and only copying the recording data corresponding to the function to be realized currently in the recording data;
and only the recording data corresponding to the function to be realized currently is stored.
Specifically, after or while the recording data is acquired by the recording component built in the terminal, the method further includes:
and preprocessing the recording data acquired by the recording component, and copying and storing the preprocessed recording data.
Specifically, after the resample data obtained after the resampling is transmitted to the speech recognition library corresponding to the function to be currently implemented for speech recognition, the method further includes:
and displaying the result after the voice recognition.
In another aspect, the present invention provides a multi-speech recognition method, comprising:
recording data are collected through a recording component arranged in the jukebox;
copying and storing the recording data acquired by the recording component;
if the function to be realized at present is the intelligent voice recognition function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the intelligent voice recognition function to obtain first resampled data;
if the function to be realized currently is the humming song-ordering function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data;
and transmitting the first resampling data to a voice instruction recognition base corresponding to the intelligent voice recognition function for voice recognition, and transmitting the second resampling data to a humming voice recognition base corresponding to the humming song-ordering function for voice recognition.
In a third aspect, the present invention provides a multi-speech recognition apparatus, comprising:
the data acquisition module is used for acquiring recording data through a recording component arranged in the terminal;
the data copying module is used for copying and storing the recording data acquired by the recording component;
the resampling module is used for resampling the copied recording data by adopting an audio data sampling rate corresponding to the currently realized function according to the currently realized function;
and the voice recognition module is used for transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the current function to be realized for voice recognition.
In a fourth aspect, the present invention provides a multi-speech recognition apparatus, comprising:
the jukebox recording module is used for acquiring recording data through a recording component arranged in the jukebox;
the copying module is used for copying and storing the recording data acquired by the recording component;
a first resampling module, configured to, if a function to be currently implemented is an intelligent voice recognition function of the jukebox, resample the copied recording data at an audio data sampling rate corresponding to the intelligent voice recognition function to obtain first resampled data;
a second resampling module, configured to, if the currently implemented function is the humming song-ordering function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data;
the first recognition module is used for transmitting the first resample data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition;
and the second recognition module is used for transmitting the second resampling data to a humming voice recognition library corresponding to the humming song-ordering function for voice recognition.
In a fifth aspect, the present invention provides a terminal, which comprises a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the method according to the above technical solution.
In a sixth aspect, the invention provides a non-transitory computer-readable storage medium, storing a computer program which, when executed by a processor, implements the steps of the method according to the above technical solution.
The technical scheme of the invention is that according to the function to be realized at present, the audio data sampling rate corresponding to the function to be realized at present is adopted to resample the recorded data, and then the resampled data obtained after resampling is subjected to voice recognition by the voice recognition library corresponding to the function to be realized at present, so that the defect that the single voice recognition library is used for recognizing various different voices in the prior art is avoided, different voice data are recognized by different voice recognition libraries, and the accuracy of voice recognition can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a multi-speech recognition method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a multi-speech recognition method according to another embodiment of the present invention;
fig. 3 is a flowchart of an application scenario in which the method provided by the embodiment of the present invention is applied to a jukebox for song-singing;
FIG. 4 is a schematic structural diagram of a multi-speech recognition apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a multi-speech recognition apparatus according to another embodiment of the present invention;
fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this specification, adjectives such as first and second may only be used to distinguish one element or action from another, without necessarily requiring or implying any actual such relationship or order. References to an element or component or step (etc.) should not be construed as limited to only one of the element, component, or step, but rather to one or more of the element, component, or step, etc., where the context permits.
In the present specification, the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The invention provides a multi-voice recognition method, which is shown in fig. 1 and mainly comprises steps S101 to S104, and is detailed as follows:
step S101: and recording data is acquired through a recording component arranged in the terminal.
Taking the terminal of the jukebox as an example, as an android system, a recording assembly is arranged in the jukebox, so that recording data can be acquired through the recording assembly arranged in the jukebox. It should be noted that, in the embodiment of the present invention, the source range of the recorded data is wide, which may be an instruction issued by the current user, which may be content hummed by the current user, and in entertainment scenes such as KTV, the recorded data further includes applause, and other conversations of other users. Since the target data processed by the terminal is the user's instruction and the content hummed by the user, applause, cheers, and other user's speech, etc. are classified as noise.
Step S102: and copying and storing the recording data acquired by the recording component.
In the embodiment of the invention, the recording data acquired by the recording component is copied and stored for subsequent processing such as resampling, voice recognition and the like. As described above, the recorded sound data includes target data such as instructions of the user and humming content of the user, and also includes noises such as applause, and speech of other users. In order to make the subsequent speech recognition more accurate, in the embodiment of the present invention, after or at the same time the recording data is collected by the recording component built in the terminal, the following may also be performed: and preprocessing the recording data acquired by the recording component, and copying and storing the preprocessed recording data, wherein the preprocessing process comprises the steps of echo removal, noise reduction and the like of the recording data.
In view of the overall performance needs to be improved, in the embodiment of the present invention, the recording data collected by the recording component may be copied and stored according to the principle of optimizing the resources of the memory and the central processing unit. Specifically, according to the principle of optimizing the resources of the memory and the central processing unit, the recording data collected by the recording component may be stored after being copied: identifying the function to be realized currently, and only copying the recording data corresponding to the function to be realized currently in the recording data; only the recording data corresponding to the function to be currently realized is saved. Because the recording data collected by the recording component is only copied and stored, the recording data corresponding to the function to be realized currently in the recording data is only copied and stored, and therefore, the method is an optimized mode from the aspects of saving memory resources, calculating resource expenditure of a central processing unit and the like.
Step S103: and according to the function to be realized currently, resampling the copied recording data by adopting the audio data sampling rate corresponding to the function to be realized currently.
In the prior art, the same speech recognition library is used for different functions during speech recognition. However, due to different functions, when performing speech recognition, different requirements are imposed on the sampling rate of data, and if the same sampling rate and the same speech recognition library are still used, inaccuracy and even error of the recognition result will inevitably occur. Different from the prior art, in the embodiment of the present invention, according to the function to be currently implemented, the audio data sampling rate corresponding to the function to be currently implemented is first adopted to resample the copied recording data, in other words, the resampled data meets the requirement of the function to be currently implemented on the data sampling rate. Here, resampling (also referred to as subsampling) may be either upsampling, i.e. the new sampling rate is greater than the original signal sampling rate, or downsampling, i.e. the new sampling rate is less than the original signal sampling rate, from the viewpoint of the sampling rate; from another perspective, resampling can be classified into methods such as nearest neighbor method, bilinear interpolation method, and cubic convolution interpolation method.
Step S104: and transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition.
As mentioned above, different functional voices require different voice recognition libraries to be recognized, so that the recognition accuracy can be ensured. Therefore, after the audio data sampling rate corresponding to the function to be realized currently is adopted to resample the copied recording data, the resampled data obtained after resampling is transmitted to the voice recognition library corresponding to the function to be realized currently for voice recognition. The result of the speech recognition is then presented, for example, in the case of a jukebox, the result of the speech recognition may be a list of songs ordered by the user, etc.
It can be known from the multi-speech recognition method illustrated in fig. 1 that, unlike the prior art that only a single speech recognition library can be used to recognize a plurality of different speeches, the technical solution of the present invention is to resample the recorded data by the sampling rate of the audio data corresponding to the function to be currently realized according to the function to be currently realized, and then perform speech recognition on the resampled data obtained after resampling by the speech recognition library corresponding to the function to be currently realized, which avoids the defect that the prior art uses the single speech recognition library to recognize a plurality of different speeches, so that different speech data are recognized by different speech recognition libraries, thereby improving the accuracy of speech recognition.
Referring to fig. 2, a flow chart of a multi-speech recognition method according to another embodiment of the invention is shown. The method is explained by taking a karaoke player of KTV as an example, and specifically comprises the following steps from S201 to S206:
step S201: recording data is acquired through a recording component arranged in the jukebox.
Generally, as a jukebox using the android system, there is a built-in recording component, and thus recording data can be collected using the built-in recording component of the jukebox. In the embodiment of the invention, the jukebox at least has a Xiao' ai intelligent voice recognition function (hereinafter referred to as an intelligent voice recognition function) and a humming song-ordering function. The recorded data may be an instruction issued by the current user through the intelligent speech recognition function of the jukebox, may be the user hummed content recorded through the humming song-ordering function, and in an entertainment scene such as KTV, the recorded data further includes a applause, a cheer and other user conversations. Because the target data processed by the terminal are the instruction sent by the current user through the intelligent voice recognition function of the jukebox and the humming content of the user recorded through the humming song-ordering function, the applause, the cheers and other conversations of the user are classified as noise. In the embodiment of the invention, the recording data is the two-channel audio data with the sampling rate of 48000Hz and the quantization precision of 16 k.
Step S202: and copying and storing the recording data acquired by the recording component.
The implementation process of step S202 is the same as that of step S102 in the foregoing embodiment, and the concepts, terms, and the like related thereto may refer to the contents of the foregoing embodiment, for example, in consideration of the resampling requirement, the resources of the system memory and the CPU are consumed, when the record data is copied, only the record data corresponding to the function to be implemented currently (for example, the intelligent voice recognition function of the jukebox) in the record data is copied, only the record data corresponding to the function to be implemented currently (for example, the intelligent voice recognition function of the jukebox) is saved, and the record data corresponding to the function not used temporarily currently, for example, the humming song-ordering function is not copied and saved.
Step S203: if the function to be realized at present is the intelligent voice recognition function of the jukebox, the copied recording data is resampled by adopting the audio data sampling rate corresponding to the intelligent voice recognition function of the jukebox to obtain first resampled data.
In the prior art, the same speech recognition library is used for different functions during speech recognition. However, due to different functions, when performing speech recognition, different requirements are imposed on parameters such as sampling rate, channel number, quantization accuracy, rate, frame rate, etc. of data, and if the same parameters such as sampling rate and the same speech recognition library are still used, inaccuracy and even error of recognition result will inevitably occur. Taking the intelligent voice recognition function of the jukebox as an example, voice recognition is subsequently performed on data acquired through the function, and the requirements of the voice recognition library on the recorded data are that the sampling rate is 16000Hz, the quantization precision is 16k and the sound channel is a single channel (left sound channel). Therefore, if the function to be implemented at present is the intelligent voice recognition function of the jukebox, the first resampled data is obtained by resampling the copied recorded data at 16000Hz, which is the audio data sampling rate corresponding to the intelligent voice recognition function of the jukebox (the resampling here is downsampling since the sampling rate of the original recorded data is 48000 Hz).
Step S204: if the function to be realized at present is the humming and song-ordering function of the jukebox, the copied recording data is resampled by adopting the audio data sampling rate corresponding to the humming and song-ordering function of the jukebox, and second resampled data is obtained.
Subsequently, voice recognition is performed on the data obtained by the humming song-ordering function, and the requirements of the voice recognition library on the recording data are that the sampling rate is 8000Hz, the quantization precision is 16k, and the sound channel is a single channel (left sound channel). Therefore, if the currently implemented function is the humming and song-ordering function of the jukebox, the copied recorded data is resampled at 8000Hz, which is the audio data sampling rate corresponding to the humming and song-ordering function of the jukebox (since the original recorded data is sampled at 48000Hz, the resampling is downsampled here), so as to obtain second resampled data.
Step S205: and transmitting the first resample data obtained in the step S203 to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition.
Step S206: the second resampled data obtained in step S204 is transmitted to the humming speech recognition library corresponding to the humming song ordering function for speech recognition.
Because the recording data that can be recognized by the speech command recognition library is data with a sampling rate of 8000Hz, a quantization accuracy of 16k, and a channel of mono (left channel), and the recording data that can be recognized by the humming speech recognition library is data with a sampling rate of 8000Hz, a quantization accuracy of 16k, and a channel of mono (left channel), the first resample data obtained in step S203 is transmitted to the speech command recognition library corresponding to the intelligent speech recognition function for speech recognition, and the second resample data obtained in step S204 is transmitted to the humming speech recognition library corresponding to the humming song ordering function for speech recognition.
Referring to fig. 3, a flow chart of applying the method provided by the embodiment of the present invention, namely, the method for multi-speech recognition using the speech command recognition library and the humming speech recognition library to the jukebox, in the application scenario, is an assistant called "love speech" on the jukebox (hereinafter referred to as "love"), in which a series of love skills help the user to realize humming and song ordering, is illustrated as follows:
step S301: arouse the love.
In the embodiment of the invention, what is needed for waking up the xiaojia is the recorded data of the dry human voice of the single-channel (left channel) with the sampling rate of 16000Hz and the quantization precision of 16 k. The user enters a voice command similar to "love classmates" which the system resamples at a sampling rate of 16000 Hz.
Step S302: and invoking a humming song-ordering function.
In the embodiment of the invention, the user can call the humming song-ordering function through the love skill of humming song-ordering. In the above steps, no matter the love or the skill of love called by the humming song ordering function, namely the humming song ordering, can be summarized as the intelligent voice recognition function of the jukebox. In order to reduce the consumption of memory and CPU resources, the lovely skill can be turned off after the humming song-ordering function is called.
Step S303: and collecting the recording data.
As previously mentioned, the recorded data may be collected by a recording assembly built into the jukebox. At this time, the humming song-ordering function is called, so that the recording data transmitted from the left channel is resampled by adopting a sampling rate of 8000Hz and quantization precision of 16k, and the data obtained by resampling is transmitted to a humming voice recognition library.
Step S304: the humming speech recognition library recognizes the resampled transcription data.
The humming voice recognition library recognizes the re-sampled recording data through a series of recognition algorithms, matches the recording data according to the song information in the cloud, and returns the result. One result is that the recognition is failed, and a prompt message of no hearing or recognition failure is returned (step S305), and the other result is that the result of successful recognition is displayed on the television side (step S306).
Step S305: prompting for inaudibility or recognition failure.
Step S306: and displaying the result of successful identification on the television.
The presented result may be, for example, a song that the user clicks, a singer name, or the like.
Step S307: continuous skill to open love.
In the embodiment of the invention, the continuous skill of the xiaoai comprises page turning, song on demand and the like, the user only needs to input a voice command of 'change a batch' or 'next page', the xiaoai executes the page turning skill (step S308), the television end changes a batch of songs for display, and if no song exists after a plurality of page turning, the television end displays 'no more songs' (step S309); alternatively, the user inputs a voice command of "xth", and the favorite performs a skill of ordering the xth song (step S310).
Step S308: the skill of turning the page is performed.
Step S309: the television side displays "no more songs".
Step S310: a skill to order the xth song is performed.
Step S311: and searching the music library.
After performing the skill of ordering the xth song, the system obtains information such as the id of the song by searching the song library according to the song name and the singer name selected by the user (step S313). If no song is found, a prompt message of 'song selection failed and no song is found' is displayed on the television (step S312).
Step S312: and displaying 'song selection fails and the song is not found' on the television.
Step S313: the ID of the song is acquired.
If the song is a local song, "request song to the ordered list and play sequentially" (step S314), and if the song is a cloud song, "add to the ordered list and complete download" (step S315).
Step S314: and ordering songs to the ordered list and playing the songs in sequence.
Step S315: add the ordered list and download is complete.
Step S316: the song is played.
Referring to fig. 4, a multi-speech recognition apparatus according to an embodiment of the present invention includes a data collection module 401, a data replication module 402, a resampling module 403, and a speech recognition module 404, which are detailed as follows:
the data acquisition module 401 is used for acquiring recording data through a recording component built in the terminal;
the data copying module 402 is configured to copy and store the recording data acquired by the recording component;
a resampling module 403, configured to resample, according to a function to be currently implemented, the copied recording data at an audio data sampling rate corresponding to the function to be currently implemented;
and the voice recognition module 404 is configured to transmit the resampled data obtained after resampling to a voice recognition library corresponding to the function to be currently implemented for voice recognition.
Referring to fig. 5, a multi-voice recognition apparatus according to another embodiment of the present invention includes a jukebox recording module 501, a copying module 502, a first resampling module 503, a second resampling module 504, a first recognition module 505, and a second recognition module 506, which are described in detail as follows:
a jukebox recording module 501 for acquiring recording data through a recording component built in the jukebox;
the copying module 502 is used for copying and storing the recording data acquired by the recording component;
a first resampling module 503, configured to, if the function to be currently implemented is an intelligent voice recognition function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the intelligent voice recognition function, to obtain first resampled data;
a second resampling module 504, configured to, if the currently implemented function is a humming song-ordering function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data;
a first recognition module 505, configured to transmit the first resample data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition;
the second recognition module 506 is used for transmitting the second resampled data to the humming speech recognition library corresponding to the humming song-ordering function for speech recognition.
It can be seen from the above description of the technical solutions that, unlike the prior art that only a single speech recognition library can be used to recognize a plurality of different speeches, the technical solution of the present invention is to resample the recorded data by using the audio data sampling rate corresponding to the function to be currently realized according to the function to be currently realized, and then perform speech recognition on the resampled data obtained after resampling by using the speech recognition library corresponding to the function to be currently realized, which avoids the defect that the prior art uses a single speech recognition library to recognize a plurality of different speeches, so that different speech data are recognized by different speech recognition libraries, thereby improving the accuracy of speech recognition.
Fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention. As shown in fig. 6, the terminal 6 of this embodiment may be a jukebox. The terminal illustrated in fig. 6 mainly includes: a processor 60, a memory 61 and a computer program 62, such as a program of a multi-voice recognition method, stored in the memory 61 and executable on the processor 60. The processor 60 executes the computer program 62 to implement the steps in the above-described embodiments of the multi-speech recognition method, such as steps S101 to S104 shown in fig. 1 or steps S201 to S206 shown in fig. 2, or the processor 60 executes the computer program 62 to implement the functions of the modules/units in the above-described embodiments of the apparatus, such as the functions of the data acquisition module 401, the data copying module 402, the resampling module 403, and the speech recognition module 404 shown in fig. 4, or the functions of the jukebox recording module 501, the copying module 502, the first resampling module 503, the second resampling module 504, the first recognition module 505, and the second recognition module 506 shown in fig. 5.
Illustratively, the computer program 62 of the multi-speech recognition method mainly includes: acquiring recording data through a recording component built in the terminal; recording data collected by the recording component is copied and then stored; according to the function to be realized at present, resampling the copied recording data by adopting the audio data sampling rate corresponding to the function to be realized at present; transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition; alternatively, the computer program 62 of the multi-speech recognition method mainly includes: recording data are collected through a recording component arranged in the jukebox; recording data collected by the recording component is copied and then stored; if the function to be realized at present is the intelligent voice recognition function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the intelligent voice recognition function of the jukebox to obtain first resampled data; if the function to be realized currently is the humming and song-ordering function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the humming and song-ordering function of the jukebox to obtain second resampled data; transmitting the obtained first resampling data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition; and transmitting the obtained second resampling data to a humming voice recognition library corresponding to the humming song-ordering function for voice recognition. The computer program 62 may be divided into one or more modules/units, which are stored in the memory 61 and executed by the processor 60 to implement the present invention. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal 5. For example, the computer program 62 may be divided into functions of the data collection module 401, the data replication module 402, the resampling module 403, and the speech recognition module 404 (modules in the virtual device), or the computer program 62 may be divided into functions of the jukebox recording module 501, the replication module 502, the first resampling module 503, the second resampling module 504, the first recognition module 505, and the second recognition module 506 (modules in the virtual device), each module having specific functions as follows: the data acquisition module 401 is used for acquiring recording data through a recording component built in the terminal; the data copying module 402 is configured to copy and store the recording data acquired by the recording component; a resampling module 403, configured to resample, according to a function to be currently implemented, the copied recording data at an audio data sampling rate corresponding to the function to be currently implemented; a voice recognition module 404, configured to transmit the resampled data obtained after resampling to a voice recognition library corresponding to a function to be currently implemented for voice recognition; a jukebox recording module 501 for acquiring recording data through a recording component built in the jukebox; the copying module 502 is used for copying and storing the recording data acquired by the recording component; a first resampling module 503, configured to, if the function to be currently implemented is an intelligent voice recognition function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the intelligent voice recognition function, to obtain first resampled data; a second resampling module 504, configured to, if the currently implemented function is a humming song-ordering function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data; a first recognition module 505, configured to transmit the first resample data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition; the second recognition module 506 is used for transmitting the second resampled data to the humming speech recognition library corresponding to the humming song-ordering function for speech recognition.
The terminal 6 may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal 6 and does not constitute a limitation on terminal 6, and may include more or fewer components than shown, or some components in combination, or different components, e.g., a computing device may also include an input-output device, a network access device, a bus, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal 6, such as a hard disk or a memory of the terminal 6. The memory 61 may also be an external storage device of the terminal 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like provided on the terminal 6. Further, the memory 61 may also include both an internal storage unit of the terminal 6 and an external storage device. The memory 61 is used for storing computer programs and other programs and data required by the terminal. The memory 61 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as required to different functional units and modules, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal and method may be implemented in other ways. For example, the above-described device/terminal embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, multiple units or components may be combined or integrated into another device, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-transitory computer readable storage medium. Based on such understanding, all or part of the processes in the method of the embodiments of the present invention may also be implemented by instructing related hardware through a computer program, where the computer program of the multiple voice recognition method may be stored in a non-transitory computer-readable storage medium, and when being executed by a processor, the computer program may implement the steps of the embodiments of the methods, that is, the recording data is collected through a recording component built in the terminal; recording data collected by the recording component is copied and then stored; according to the function to be realized at present, resampling the copied recording data by adopting the audio data sampling rate corresponding to the function to be realized at present; transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition; or, recording data is collected through a recording component arranged in the jukebox; recording data collected by the recording component is copied and then stored; if the function to be realized at present is the intelligent voice recognition function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the intelligent voice recognition function of the jukebox to obtain first resampled data; if the function to be realized currently is the humming and song-ordering function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the humming and song-ordering function of the jukebox to obtain second resampled data; transmitting the obtained first resampling data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition; and transmitting the obtained second resampling data to a humming voice recognition library corresponding to the humming song-ordering function for voice recognition. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The non-transitory computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the non-transitory computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, non-transitory computer readable media does not include electrical carrier signals and telecommunications signals as subject to legislation and patent practice. The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A multi-speech recognition method, comprising:
acquiring recording data through a recording component built in the terminal;
copying and storing the recording data acquired by the recording component;
according to the function to be realized currently, resampling the copied recording data by adopting an audio data sampling rate corresponding to the function to be realized currently;
and transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the function to be realized currently for voice recognition.
2. The multi-speech recognition method of claim 1, wherein copying and storing the recording data collected by the recording component comprises:
and copying and storing the recording data acquired by the recording component according to the principle of optimizing the resources of the memory and the central processing unit.
3. The multi-speech recognition method of claim 2, wherein the copying and storing the recording data collected by the recording module according to the principle of optimizing the resources of the memory and the central processing unit comprises:
identifying the function to be realized currently, and only copying the recording data corresponding to the function to be realized currently in the recording data;
and only the recording data corresponding to the function to be realized currently is stored.
4. The multi-speech recognition method of claim 1, wherein after or while the recording data is collected by a recording module built in the terminal, the method further comprises:
and preprocessing the recording data acquired by the recording component, and copying and storing the preprocessed recording data.
5. The multi-speech recognition method according to any one of claims 1 to 4, wherein after transmitting the resampled data obtained by the resampling to a speech recognition library corresponding to the function to be currently implemented for speech recognition, the method further comprises:
and displaying the result after the voice recognition.
6. A multi-speech recognition method, comprising:
recording data are collected through a recording component arranged in the jukebox;
copying and storing the recording data acquired by the recording component;
if the function to be realized at present is the intelligent voice recognition function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the intelligent voice recognition function to obtain first resampled data;
if the function to be realized currently is the humming song-ordering function of the jukebox, resampling the copied recording data by adopting the audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data;
and transmitting the first resampling data to a voice instruction recognition base corresponding to the intelligent voice recognition function for voice recognition, and transmitting the second resampling data to a humming voice recognition base corresponding to the humming song-ordering function for voice recognition.
7. A multi-speech recognition apparatus, comprising:
the data acquisition module is used for acquiring recording data through a recording component arranged in the terminal;
the data copying module is used for copying and storing the recording data acquired by the recording component;
the resampling module is used for resampling the copied recording data by adopting an audio data sampling rate corresponding to the currently realized function according to the currently realized function;
and the voice recognition module is used for transmitting the resampled data obtained after resampling to a voice recognition library corresponding to the current function to be realized for voice recognition.
8. A multi-speech recognition apparatus, comprising:
the jukebox recording module is used for acquiring recording data through a recording component arranged in the jukebox;
the copying module is used for copying and storing the recording data acquired by the recording component;
a first resampling module, configured to, if a function to be currently implemented is an intelligent voice recognition function of the jukebox, resample the copied recording data at an audio data sampling rate corresponding to the intelligent voice recognition function to obtain first resampled data;
a second resampling module, configured to, if the currently implemented function is the humming song-ordering function of the jukebox, resample the copied recording data by using an audio data sampling rate corresponding to the humming song-ordering function to obtain second resampled data;
the first recognition module is used for transmitting the first resample data to a voice instruction recognition library corresponding to the intelligent voice recognition function for voice recognition;
and the second recognition module is used for transmitting the second resampling data to a humming voice recognition library corresponding to the humming song-ordering function for voice recognition.
9. A terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the computer program.
10. A non-transitory computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
CN202010302149.0A 2020-04-16 2020-04-16 Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium Pending CN111477218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010302149.0A CN111477218A (en) 2020-04-16 2020-04-16 Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010302149.0A CN111477218A (en) 2020-04-16 2020-04-16 Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN111477218A true CN111477218A (en) 2020-07-31

Family

ID=71753764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010302149.0A Pending CN111477218A (en) 2020-04-16 2020-04-16 Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111477218A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114242105A (en) * 2022-02-24 2022-03-25 麒麟软件有限公司 Method and system for implementing recording and noise reduction on Android application

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007309979A (en) * 2006-05-16 2007-11-29 Advanced Telecommunication Research Institute International Voice processing apparatus and program
CN103886860A (en) * 2014-02-21 2014-06-25 联想(北京)有限公司 Information processing method and electronic device
CN104811777A (en) * 2014-01-23 2015-07-29 阿里巴巴集团控股有限公司 Smart television voice processing method, smart television voice processing system and smart television
CN105513590A (en) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN207199291U (en) * 2017-06-19 2018-04-06 张君莉 Program request apparatus
CN109461429A (en) * 2018-10-20 2019-03-12 深圳市创成微电子有限公司 A kind of AI K song microphone speaker integrated equipment
CN110136744A (en) * 2019-05-24 2019-08-16 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency fingerprint generation method, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007309979A (en) * 2006-05-16 2007-11-29 Advanced Telecommunication Research Institute International Voice processing apparatus and program
CN104811777A (en) * 2014-01-23 2015-07-29 阿里巴巴集团控股有限公司 Smart television voice processing method, smart television voice processing system and smart television
CN103886860A (en) * 2014-02-21 2014-06-25 联想(北京)有限公司 Information processing method and electronic device
CN105513590A (en) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN207199291U (en) * 2017-06-19 2018-04-06 张君莉 Program request apparatus
CN109461429A (en) * 2018-10-20 2019-03-12 深圳市创成微电子有限公司 A kind of AI K song microphone speaker integrated equipment
CN110136744A (en) * 2019-05-24 2019-08-16 腾讯音乐娱乐科技(深圳)有限公司 A kind of audio-frequency fingerprint generation method, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鞠源: "基于哼唱的音乐检索系统的研究与探索", 《情报杂志》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114242105A (en) * 2022-02-24 2022-03-25 麒麟软件有限公司 Method and system for implementing recording and noise reduction on Android application

Similar Documents

Publication Publication Date Title
CN107463700B (en) Method, device and equipment for acquiring information
TWI711967B (en) Method, device and equipment for determining broadcast voice
CN112037792B (en) Voice recognition method and device, electronic equipment and storage medium
CN109474843A (en) The method of speech control terminal, client, server
US11587560B2 (en) Voice interaction method, device, apparatus and server
EP4171018A1 (en) Subtitle generation method and apparatus, and device and storage medium
CN105930485A (en) Audio media playing method, communication device and network system
CN111556353A (en) Video playing method, video playing management device and terminal equipment
CN109710799B (en) Voice interaction method, medium, device and computing equipment
CN111192594B (en) Method for separating voice and accompaniment and related product
JP2020003774A (en) Method and apparatus for processing speech
CN111309857A (en) Processing method and processing device
CN112687286A (en) Method and device for adjusting noise reduction model of audio equipment
CN111477218A (en) Multi-voice recognition method, device, terminal and non-transitory computer-readable storage medium
CN110889009A (en) Voiceprint clustering method, voiceprint clustering device, processing equipment and computer storage medium
KR20210001082A (en) Electornic device for processing user utterance and method for operating thereof
CN111161734A (en) Voice interaction method and device based on designated scene
WO2023005193A1 (en) Subtitle display method and device
JP6944920B2 (en) Smart interactive processing methods, equipment, equipment and computer storage media
CN111145741B (en) Method and device for providing multimedia content, electronic equipment and storage medium
CN103136277A (en) Multimedia file playing method and electronic device
CN103915094A (en) Shared voice control method and device based on target name recognition
CN109495786B (en) Pre-configuration method and device of video processing parameter information and electronic equipment
CN112596846A (en) Method and device for determining interface display content, terminal equipment and storage medium
CN117608506A (en) Information display method, information display device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination