CN111081080A

CN111081080A - Voice detection method and learning device

Info

Publication number: CN111081080A
Application number: CN201910459781.3A
Authority: CN
Inventors: 彭婕
Original assignee: Shenzhen China Star Optoelectronics Technology Co Ltd
Current assignee: TCL China Star Optoelectronics Technology Co Ltd
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2020-04-28
Anticipated expiration: 2039-05-29
Also published as: CN111081080B

Abstract

The invention relates to the technical field of education, and discloses a voice detection method and learning equipment, which comprise the following steps: collecting target sound in the environment where the learning equipment is located when the fact that the fingers of the user are in a contact state with the point-reading page is detected; acquiring character information corresponding to the contact position of a user finger and a touch page, and determining the standard pronunciation of the character information; selecting a voice fragment matched with the standard pronunciation from the target sound; calculating the target similarity of the voice fragment and the standard pronunciation, and detecting whether a target voice fragment with the target similarity larger than the preset similarity exists or not; and if so, determining the pronunciation standard of the user for spelling the text information. By implementing the embodiment of the invention, the voice segment matched with the standard pronunciation can be selected based on the complete voice, so that the learning device can detect the selected voice segment, thereby determining the standard property of the voice input by the user.

Description

Voice detection method and learning device

Technical Field

The invention relates to the technical field of education, in particular to a voice detection method and learning equipment.

Background

At present, most learning devices such as family education machines and learning flat plates can detect the voices of words spelled by students, and the current voice detection mode is generally as follows: the learning equipment collects the voice of the words spelled by the student within a certain time, detects the collected voice, and corrects the voice of the student if the voice is wrong. However, in practice, it is found that a student may try to spell and read an unfamiliar word for multiple times, and since the time for acquiring the voice by the current learning device is limited, the learning device may not acquire the correct pronunciation input by the student, and thus, it is seen that the learning device may correct the pronunciation of the word read by the student, thereby reducing the accuracy rate of detecting the voice standard of the spelling and reading words by the student by the learning device, and also reducing the learning efficiency of the student.

Disclosure of Invention

The embodiment of the invention discloses a voice detection method and learning equipment, which can enable the learning equipment to detect the voice standard more accurately.

The first aspect of the embodiments of the present invention discloses a voice detection method, which includes:

collecting target sound in the environment where the learning equipment is located when the fact that the fingers of the user are in a contact state with the point-reading page is detected;

acquiring character information corresponding to the contact position of the user finger and the touch-and-read page, and determining the standard pronunciation of the character information;

selecting a voice fragment matched with the standard pronunciation from the target sound;

calculating the target similarity of the voice fragment and the standard pronunciation, and detecting whether a target voice fragment with the target similarity larger than the preset similarity exists or not;

and if so, determining the pronunciation standard of the user for spelling the text message.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, after detecting that the target speech segment does not exist, the method further includes:

determining the pronunciation error of the word information spelled by the user, and acquiring pronunciation guide information corresponding to the word information;

outputting the pronunciation guide information through a display screen of the learning equipment, and outputting the standard pronunciation through a loudspeaker of the learning equipment;

and, the method further comprises:

when receiving reading-after voice input by a user, detecting the current similarity of the reading-after voice and the standard reading voice;

and when the current similarity is detected to be larger than the preset similarity, determining the pronunciation standard of the reading-after voice.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the acquiring, when it is detected that the finger of the user is in a contact state with the page to be read, a target sound in an environment where the learning device is located includes:

when the fact that a finger of a user is in contact with the touch-and-read page is detected, starting an audio acquisition device of a learning device, and enabling the audio acquisition device to acquire sound in the environment where the learning device is located;

and when the fact that the user finger is not in contact with the point-reading page is detected, closing the audio acquisition equipment to obtain the target sound acquired by the audio acquisition equipment.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the acquiring text information corresponding to a contact position of the finger of the user and the click-to-read page, and determining a standard pronunciation of the text information includes:

acquiring a contact track of the finger of the user contacting the point-reading page;

identifying a contact coordinate corresponding to the contact track in the point reading page;

acquiring corresponding text information of the contact coordinate in the click-to-read page;

and determining the standard pronunciation matched with the text information from a preset voice library.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the selecting, from the target sound, a speech segment matching the standard pronunciation, includes:

generating a first waveform spectrogram corresponding to the target sound and generating a second waveform spectrogram corresponding to the standard pronunciation;

selecting a target waveform frequency spectrum segment matched with the second waveform frequency spectrum from the first waveform frequency spectrum;

and determining a voice segment corresponding to the target waveform spectrum segment from the target sound.

A second aspect of the embodiments of the present invention discloses a learning apparatus, including:

the acquisition unit is used for acquiring target sound in the environment where the learning equipment is located when the condition that the finger of the user is in contact with the touch-and-read page is detected;

the first acquisition unit is used for acquiring character information corresponding to the contact position of the user finger and the point reading page and determining the standard pronunciation of the character information;

the selecting unit is used for selecting a voice fragment matched with the standard pronunciation from the target sound;

the first detection unit is used for calculating the target similarity between the voice fragment and the standard pronunciation and detecting whether a target voice fragment with the target similarity larger than the preset similarity exists or not;

and the first determining unit is used for determining the pronunciation standard of the character information spelled and read by the user when the detection result of the first detecting unit is positive.

As an alternative implementation, in the second aspect of the embodiment of the present invention, the learning apparatus further includes:

the second acquisition unit is used for determining that the pronunciation of the character information is wrongly spelled by the user and acquiring pronunciation guide information corresponding to the character information when the detection result of the first detection unit is negative;

the output unit is used for outputting the pronunciation guide information through a display screen of the learning equipment and outputting the standard pronunciation through a loudspeaker of the learning equipment;

and, the learning apparatus further includes:

the second detection unit is used for detecting the current similarity of the reading-after voice and the standard reading voice when the reading-after voice input by a user is received;

and the second determining unit is used for determining the pronunciation standard of the reading-after voice when the current similarity is detected to be larger than the preset similarity.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the acquisition unit includes:

the starting sub-unit is used for starting the audio acquisition equipment of the learning equipment when detecting that the finger of a user is in contact with the touch-and-read page, so that the audio acquisition equipment acquires the sound in the environment where the learning equipment is located;

and the closing subunit is used for closing the audio acquisition equipment to obtain the target sound acquired by the audio acquisition equipment when detecting that the finger of the user is not in contact with the touch-and-talk page.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the first obtaining unit includes:

the first acquisition subunit is used for acquiring a contact track of the finger of the user contacting the touch-and-read page;

the identifying subunit is used for identifying the contact coordinate corresponding to the contact track in the point reading page;

the second acquisition subunit is used for acquiring the corresponding text information of the contact coordinate in the click-to-read page;

and the first determining subunit is used for determining the standard pronunciation matched with the text information from a preset voice library.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the selecting unit includes:

the generating subunit is configured to generate a first waveform spectrogram corresponding to the target sound, and generate a second waveform spectrogram corresponding to the standard reading;

a selecting subunit, configured to select, from the first waveform spectrogram, a target waveform spectrogram segment matched with the second waveform spectrogram;

and the second determining subunit is used for determining a voice segment corresponding to the target waveform spectrum segment from the target sound.

A third aspect of an embodiment of the present invention discloses an electronic device, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to perform part or all of the steps of any one of the methods of the first aspect.

A fourth aspect of the present embodiments discloses a computer-readable storage medium storing a program code, where the program code includes instructions for performing part or all of the steps of any one of the methods of the first aspect.

A fifth aspect of embodiments of the present invention discloses a computer program product, which, when run on a computer, causes the computer to perform some or all of the steps of any one of the methods of the first aspect.

A sixth aspect of the present embodiment discloses an application publishing platform, where the application publishing platform is configured to publish a computer program product, where the computer program product is configured to, when running on a computer, cause the computer to perform part or all of the steps of any one of the methods in the first aspect.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, when the condition that the finger of the user is in a contact state with the touch-and-read page is detected, the target sound in the environment where the learning equipment is located is collected; acquiring character information corresponding to the contact position of a user finger and a touch page, and determining the standard pronunciation of the character information; selecting a voice fragment matched with the standard pronunciation from the target sound; calculating the target similarity of the voice fragment and the standard pronunciation, and detecting whether a target voice fragment with the target similarity larger than the preset similarity exists or not; and if so, determining the pronunciation standard of the user for spelling the text information. Therefore, by implementing the embodiment of the invention, the voice input by the user can be collected in the process of clicking the word to be spelled by the user, the integrity of the collected voice input by the user is ensured, and the voice segment matched with the standard pronunciation is further selected based on the complete voice, so that the learning equipment can detect the selected voice segment, and the standard property of the voice input by the user is determined.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a voice detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech detection method disclosed in the embodiments of the present invention;

FIG. 3 is a flow chart of another speech detection method disclosed in the embodiments of the present invention;

FIG. 4 is a schematic structural diagram of a learning device according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of another learning device disclosed in the embodiment of the present invention;

FIG. 6 is a schematic structural diagram of another learning device disclosed in the embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The embodiment of the invention discloses a voice detection method and learning equipment, which can select a voice segment matched with standard pronunciation based on complete voice so that the learning equipment can detect the selected voice segment, and therefore the learning equipment can detect the voice standard more accurately. The following are detailed below.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a voice detection method according to an embodiment of the present invention. As shown in fig. 1, the voice detection method may include the steps of:

101. and when the fact that the finger of the user is in a contact state with the point-reading page is detected, the learning device collects target sound in the environment where the learning device is located.

In the embodiment of the invention, the learning equipment can be a family education machine, a learning flat plate and the like. The contact state of the finger of the user and the reading page can be that the finger of the user is in contact with the reading page (such as pressing, moving or smearing operations), the target sound in the environment where the learning device is located, which is acquired by the learning device, needs to be in the contact state of the finger of the user and the reading page, and when the finger of the user is separated from the reading page, the learning device does not acquire the target sound. The click-to-read page may be a book page, a paper page, or the like, or may be a display screen of the learning device, which is not limited in the embodiment of the present invention. In addition, the learning device can collect the sound in the environment where the learning device is located through an audio collecting device (such as a microphone).

102. The learning equipment acquires the character information corresponding to the contact position of the user finger and the point-reading page and determines the standard pronunciation of the character information.

In the embodiment of the present invention, a finger of a user may perform a click operation in a click-to-read page, and then the contact position may be identified as any point in the click-to-read page, and may also move in the click-to-read page, and then the contact position may be identified as a line in the click-to-read page, and the identified line may be a line segment or a curve, which is not limited in the embodiment of the present invention. If the contact position is any point in the click-to-read page, the learning device may acquire text information corresponding to the point, and if the contact position is a line in the click-to-read page, the learning device may acquire an area covered by the line in the click-to-read page, and may acquire text information included in the covered area from the click-to-read page.

As an optional implementation manner, the manner in which the learning device acquires the text information corresponding to the contact position of the finger of the user and the click-to-read page, and determines the standard pronunciation of the text information may include the following steps:

the learning equipment acquires character information corresponding to the contact position of a user finger and a touch page;

the learning equipment identifies whether the character information is formed by English letters;

if the language pronunciation database does not consist of English letters, the learning equipment identifies a target language corresponding to the character information and determines the standard pronunciation of the character information from a language pronunciation database corresponding to the target language;

if the target voice is composed of English letters, the learning equipment identifies the pronunciation rule of the target voice;

when the pronunciation rule is identified as a pinyin rule, the learning equipment determines that English letters contained in the character information are pinyin, and determines the standard pronunciation of the character information in a preset pinyin pronunciation library;

when the pronunciation rule is recognized as an English rule, the learning device determines that English letters contained in the character information are English, and determines the standard pronunciation of the character information in a preset English pronunciation library.

By implementing the implementation mode, the standard pronunciation of the character information can be determined from different language pronunciation libraries according to different languages corresponding to the character information, and when the character information is detected to be obtained by combining English letters, the character information can be determined to be English or pinyin according to the pronunciation rules of the target sound input by the user, and then different standard pronunciations are generated according to different pronunciation rules, so that the determination of the standard pronunciation is more accurate.

103. The learning device selects a speech segment matching the standard pronunciation from the target sound.

As an alternative embodiment, the way of the learning device to select the speech segment matching the standard pronunciation from the target sound may include the following steps:

the learning equipment processes the target sound through a noise reduction technology to obtain a noise-reduced target sound;

the learning equipment identifies voice from the target noise reduction sound through a human voice identification technology;

the learning equipment acquires voiceprint information matched with the identity information of the user;

the learning equipment extracts target voice matched with the voiceprint information from the voice;

the learning device matches the target voice with the standard pronunciation so as to divide a plurality of voice fragments matched with the standard pronunciation from the target voice.

In this embodiment, because the environment where the learning device is located may include the voice of a person other than the user and may also include other voices other than the voice of the person, the learning device may perform noise reduction processing on the collected voice first to obtain target noise reduction voice with less noise, and the learning device may further extract the target voice matching the voiceprint information of the user of the learning device from the target noise reduction voice, so that the voice segments segmented according to the target voice are the voice input by the user, thereby ensuring the accuracy of the voice detection result.

104. The learning equipment calculates the target similarity between the voice fragment and the standard pronunciation, detects whether a target voice fragment with the target similarity larger than the preset similarity exists or not, and if so, executes the step 105; if not, the flow is ended.

In the embodiment of the invention, the preset similarity can be obtained by analyzing the previous pronunciation input by the user, the learning device can obtain the pre-stored mass historical information which is correct for the previous spelling of the user, the historical information can comprise the historical standard pronunciation and the historical spelling voice, the learning device can calculate the historical similarity of the historical standard pronunciation and the historical spelling voice contained in each piece of historical information, further calculate the average similarity according to all the historical similarities, and the learning device can determine the calculated average similarity as the preset similarity, so that the determination of the preset similarity is related to the previous spelling mode of the user, and the accuracy of the learning device for identifying the user spelling voice is improved.

105. The learning device determines a pronunciation standard for the user to spell the text information.

In the method described in fig. 1, a speech segment matching the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the method described in fig. 1 can be implemented to make the determination of the standard pronunciation more accurate. In addition, the method described in fig. 1 is implemented, so that the accuracy of the voice detection result is ensured.

Example two

Referring to fig. 2, fig. 2 is a schematic flow chart of another speech detection method according to an embodiment of the present invention. As shown in fig. 2, the voice detection method may include the steps of:

201. when the fact that the user fingers are in contact with the touch-and-read page is detected, the learning device starts an audio acquisition device of the learning device, and therefore the audio acquisition device can acquire sounds in the environment where the learning device is located.

In the embodiment of the invention, the audio acquisition equipment can be a microphone, a sound pickup and the like, and all sounds existing in the environment where the learning equipment is located can be acquired through the audio acquisition equipment.

202. And when the fact that the finger of the user is not in contact with the touch-and-read page is detected, the learning device closes the audio acquisition device to obtain the target sound acquired by the audio acquisition device.

In the embodiment of the present invention, by implementing the above steps 201 to 202, the sound in the environment where the learning device is located may be collected in the process of the finger of the user contacting the touch-and-talk page, and when it is detected that the finger of the user is separated from the touch-and-talk page, the sound collection is stopped, so that the time length of the sound collected by the learning device may be controlled by the user of the learning device, thereby improving the interactivity between the learning device and the user.

203. The learning equipment acquires the character information corresponding to the contact position of the user finger and the point-reading page and determines the standard pronunciation of the character information.

204. The learning device selects a speech segment matching the standard pronunciation from the target sound.

205. The learning equipment calculates the target similarity between the voice fragment and the standard pronunciation, detects whether a target voice fragment with the target similarity larger than the preset similarity exists or not, and if so, executes step 206; if not, step 207-step 210 are executed.

206. The learning device determines a pronunciation standard for the user to spell the text information.

207. The learning equipment determines the pronunciation error of the user spelling the character information and acquires pronunciation guide information corresponding to the character information.

In the embodiment of the invention, the pronunciation guide information can be a method for guiding the user to spell and read standard pronunciation, for example, when the character information is a Chinese character and word, the pronunciation guide information corresponding to the Chinese character and word can be pinyin corresponding to the Chinese character and word, and the learning equipment can realize the function of guiding the user to spell and read the character information by outputting the pinyin corresponding to the Chinese character and word.

208. The learning device outputs pronunciation guide information through a display screen of the learning device and outputs standard pronunciation through a loudspeaker of the learning device.

In the embodiment of the invention, the learning equipment can control the display screen and the loudspeaker of the learning equipment to simultaneously output the pronunciation guide information and the standard pronunciation, so that a user can hear the standard pronunciation while checking the pronunciation guide information, thereby having a more profound impression on the standard pronunciation corresponding to the character information. In addition, the learning device can control the display screen and the loudspeaker of the learning device not to output pronunciation guide information and standard pronunciation at the same time, and can determine the content to be output through instructions input by a user.

209. When receiving the reading-after voice input by the user, the learning device detects the current similarity of the reading-after voice and the standard reading voice.

In the embodiment of the invention, the learning device can also start the audio acquisition device while outputting the pronunciation guide information and/or the standard pronunciation, so that the audio acquisition device acquires the pronunciation read by the user, thereby detecting the accuracy of the pronunciation read by the user, further guiding the pronunciation of the user and improving the accuracy of the standard pronunciation learned by the user.

210. And when the current similarity is detected to be larger than the preset similarity, the learning equipment determines the pronunciation standard of the follow-up reading voice.

In the embodiment of the present invention, by implementing the above step 207 to step 210, when it is detected that the voice of the user spelling the text information is incorrect, the standard pronunciation can be output, so that the user can follow the reading according to the standard pronunciation, thereby correcting the incorrect pronunciation of the user and improving the accuracy of the user spelling the text information.

In the method described in fig. 2, a speech segment matching the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the method described in fig. 2 is implemented to improve the interactivity between the learning device and the user. In addition, the method described in fig. 2 is implemented, so that the accuracy of the user for spelling the text information is improved.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic flow chart of another speech detection method according to an embodiment of the present invention. As shown in fig. 3, the voice detection method may include the steps of:

301. and when the fact that the finger of the user is in a contact state with the point-reading page is detected, the learning device collects target sound in the environment where the learning device is located.

302. The learning device acquires a contact track of a finger of a user contacting the click-to-read page.

303. The learning device identifies contact coordinates corresponding to the contact trajectory in the click page.

In the embodiment of the invention, the learning device can establish the rectangular plane coordinate system on the basis of the reading page, so that the unique coordinates corresponding to any point in the reading page are obtained, and each character in the reading page can correspond to a plurality of coordinates, namely, the character corresponding to any coordinate is unique.

304. The learning equipment acquires corresponding text information of the contact coordinates in the click page.

305. The learning equipment determines standard pronunciation matched with the text information from a preset voice library.

In the embodiment of the present invention, the preset speech library may be a local speech library pre-established by the learning device, or may be a speech library externally connected to the learning device from a network, which is not limited in the embodiment of the present invention.

In the embodiment of the present invention, by implementing the steps 302 to 305, the movement track of the finger of the user on the click-to-read page can be detected, and all the text information corresponding to the movement track can be acquired, so that the text information acquired by the learning device is more comprehensive.

306. The learning device generates a first waveform spectrogram corresponding to the target sound and generates a second waveform spectrogram corresponding to the standard pronunciation.

In the embodiment of the present invention, the waveform spectrogram may show waveforms of sounds at various times, frequencies of the sounds, and other information, and the waveform spectrums are the same for the same pronunciation, so that whether the target sound matches the standard sound may be determined by comparing the waveform spectrograms of the target sound and the standard pronunciation.

307. The learning device selects a target waveform spectrum segment matched with the second waveform spectrum from the first waveform spectrum.

In the embodiment of the invention, the learning device can determine a plurality of target waveform spectrum segments similar to the second waveform spectrum image from the first waveform spectrum image, and if the learning device does not identify that the target waveform spectrum segments similar to the second waveform spectrum image exist in the first waveform spectrum image, the learning device can directly think that the user fails to spell the character information correctly.

308. The learning device determines a speech segment corresponding to a target waveform spectrum segment from the target sound.

In the embodiment of the present invention, by implementing the steps 306 to 308, the waveform spectrogram corresponding to the target sound and the waveform spectrogram corresponding to the standard pronunciation may be obtained first, and then the obtained two waveform spectrograms are compared to obtain a similar waveform spectrogram segment, so that a voice segment obtained by comparing the two waveform spectrograms is more accurate according to the obtained voice segment of the similar waveform spectrogram segment.

309. The learning equipment calculates the target similarity between the voice fragment and the standard pronunciation, detects whether a target voice fragment with the target similarity larger than the preset similarity exists or not, and if so, executes the step 310; if not, the flow is ended.

As an alternative embodiment, the way that the learning device calculates the target similarity of the speech segment and the standard pronunciation may include the following steps:

the learning device determines a waveform spectrum segment of a voice segment;

the learning equipment acquires a standard abscissa and a standard ordinate in a second waveform spectrogram corresponding to the standard pronunciation;

the learning equipment adjusts the target abscissa in the waveform frequency spectrum segment into a standard abscissa and adjusts the target ordinate in the waveform frequency spectrum segment into a standard ordinate, so as to obtain an adjusted target waveform frequency spectrum segment;

the learning equipment calculates the coincidence rate of the target waveform spectrum segment and the second waveform spectrogram;

the learning device determines the coincidence rate as a target similarity of the speech piece to the standard pronunciation.

By implementing the implementation mode, the abscissa and the ordinate of the waveform spectrum segment can be adjusted to be the same as those of the second waveform spectrogram, so that the waveform spectrum segment and the second waveform spectrogram are compared under the same standard, and the target similarity of the obtained voice segment and the standard pronunciation is more reliable.

310. The learning device determines a pronunciation standard for the user to spell the text information.

In the method described in fig. 3, a speech segment matching the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the implementation of the method described in fig. 3 can make the text information acquired by the learning device more comprehensive. In addition, the method described in fig. 3 is implemented to make the speech segment obtained by comparing the two waveform spectrograms more accurate. In addition, the method described in fig. 3 is implemented, so that the obtained speech segment has more credibility with the target similarity of the standard pronunciation.

Example four

Referring to fig. 4, fig. 4 is a schematic structural diagram of a learning device according to an embodiment of the present invention. As shown in fig. 4, the learning apparatus may include:

the acquisition unit 401 is configured to acquire a target sound in an environment where the learning device is located when it is detected that the finger of the user is in a contact state with the click-to-read page.

The first obtaining unit 402 is configured to obtain text information corresponding to a contact position of a finger of a user and a touch page, and determine a standard pronunciation of the text information.

As an optional implementation manner, the manner of acquiring the text information corresponding to the contact position of the finger of the user and the page to be read by the first acquiring unit 402 and determining the standard pronunciation of the text information may specifically be:

acquiring character information corresponding to the contact position of a user finger and a touch-and-read page;

identifying whether the character information is composed of English letters;

if the language reading library does not consist of English letters, identifying a target language corresponding to the character information, and determining the standard pronunciation of the character information from a language reading library corresponding to the target language;

if the target sound is composed of English letters, recognizing the pronunciation rule of the target sound;

when the pronunciation rule is identified as a pinyin rule, determining that English letters contained in the character information are pinyin, and determining the standard pronunciation of the character information in a preset pinyin pronunciation library;

and when the pronunciation rule is identified as an English rule, determining that English letters contained in the character information are English, and determining the standard pronunciation of the character information in a preset English pronunciation library.

A selecting unit 403, configured to select a voice segment matching the standard pronunciation acquired by the first acquiring unit 402 from the target sound acquired by the acquiring unit 401.

As an alternative implementation, the manner of selecting the speech segment matching the standard pronunciation from the target sound by the selecting unit 403 may specifically be:

processing the target sound through a noise reduction technology to obtain a noise-reduced target sound;

recognizing voice from the target noise reduction sound through a human voice recognition technology;

acquiring voiceprint information matched with identity information of a user;

extracting target voice matched with the voiceprint information from the voice;

and matching the target voice with the standard pronunciation so as to segment a plurality of voice fragments matched with the standard pronunciation from the target voice.

The first detecting unit 404 is configured to calculate a target similarity between the voice segment selected by the selecting unit 403 and the standard pronunciation acquired by the first acquiring unit 402, and detect whether there is a target voice segment with a target similarity greater than a preset similarity.

As an optional implementation manner, the way for the first detecting unit 404 to calculate the target similarity between the speech segment and the standard pronunciation may specifically be:

determining a waveform spectrum segment of a voice segment;

acquiring a standard abscissa and a standard ordinate in a second waveform spectrogram corresponding to the standard pronunciation;

adjusting a target abscissa in the waveform frequency spectrum segment into a standard abscissa, and adjusting a target ordinate in the waveform frequency spectrum segment into a standard ordinate, so as to obtain an adjusted target waveform frequency spectrum segment;

calculating the coincidence rate of the target waveform spectrum segment and the second waveform spectrogram;

the coincidence rate is determined as the target similarity of the speech segment to the standard pronunciation.

A first determining unit 405, configured to determine a pronunciation standard for the user to spell the text information when the detection result of the first detecting unit 404 is yes.

Therefore, by implementing the learning device described in fig. 4, the speech segment matched with the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the learning device described in fig. 4 can be implemented to make the determination of the standard pronunciation more accurate. In addition, the learning device described in fig. 4 is implemented, so that the accuracy of the voice detection result is ensured. In addition, the learning device described in fig. 4 is implemented, so that the obtained speech segment has more credibility with the target similarity of the standard pronunciation.

EXAMPLE five

Referring to fig. 5, fig. 5 is a schematic structural diagram of another learning apparatus according to an embodiment of the present invention. The learning apparatus shown in fig. 5 is optimized by the learning apparatus shown in fig. 4. The learning apparatus shown in fig. 5 may further include:

and a second obtaining unit 406, configured to determine that the user pronounces the text information incorrectly, and obtain pronunciation guide information corresponding to the text information, when the detection result of the first detecting unit 404 is negative.

An output unit 407, configured to output the pronunciation guide information acquired by the second acquisition unit 406 through a display screen of the learning device, and output the standard pronunciation acquired by the first acquisition unit 402 through a speaker of the learning device.

The second detecting unit 408 is configured to detect a current similarity between the read-after speech and the standard reading obtained by the first obtaining unit 402 when the read-after speech input by the user is received.

The second determining unit 409 is configured to determine the pronunciation standard of the reading-after voice when the second detecting unit 408 detects that the current similarity is greater than the preset similarity.

In the embodiment of the invention, when the voice of the user for spelling the text information is detected to be wrong, the standard pronunciation can be output, so that the user can follow the reading according to the standard pronunciation, thereby correcting the wrong pronunciation of the user and improving the accuracy of the user for spelling the text information.

As an alternative embodiment, the acquisition unit 401 of the learning apparatus shown in fig. 5 may include:

the opening sub-unit 4011 is configured to, when it is detected that a finger of a user contacts with the touch-and-talk page, open an audio acquisition device of the learning device, so that the audio acquisition device acquires sounds in an environment where the learning device is located;

and the closing sub-unit 4012 is configured to, when it is detected that the finger of the user is not in contact with the click-to-read page, close the audio acquisition device opened by the opening sub-unit 4011, and obtain a target sound acquired by the audio acquisition device.

By the implementation of the implementation mode, the sound in the environment where the learning device is located can be collected in the process that the finger of the user is in contact with the touch-and-read page, and when the separation of the finger of the user from the touch-and-read page is detected, the collection of the sound is stopped, so that the duration of the sound collected by the learning device can be controlled by the user of the learning device, and the interactivity between the learning device and the user is improved.

Therefore, by implementing the learning device described in fig. 5, the speech segment matched with the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the learning device described in fig. 5 is implemented, so that the accuracy of spelling and reading the text information by the user is improved. In addition, the learning device described in fig. 5 is implemented, and the interactivity between the learning device and the user is improved.

EXAMPLE six

Referring to fig. 6, fig. 6 is a schematic structural diagram of another learning apparatus according to an embodiment of the present invention. The learning apparatus shown in fig. 6 is optimized by the learning apparatus shown in fig. 5. The first acquisition unit 402 of the learning apparatus shown in fig. 6 may include:

the first obtaining subunit 4021 is configured to obtain a contact trajectory of a finger of a user contacting the page to be read.

The identifying subunit 4022 is configured to identify, in the page to be read, a contact coordinate corresponding to the contact trajectory acquired by the first acquiring subunit 4021.

The second obtaining subunit 4023 is configured to obtain text information corresponding to the touch coordinate identified by the identifying subunit 4022 in the page of touch reading.

The first determining subunit 4024 is configured to determine, from a preset voice library, a standard pronunciation matched with the text information acquired by the second acquiring subunit 4023.

In the embodiment of the invention, the moving track of the finger of the user on the click-to-read page can be detected, and all the character information corresponding to the moving track can be acquired, so that the character information acquired by the learning equipment is more comprehensive.

As an alternative embodiment, the selecting unit 403 of the learning apparatus shown in fig. 6 may include:

a generating subunit 4031, configured to generate a first waveform spectrogram corresponding to the target sound acquired by the acquisition unit 401, and generate a second waveform spectrogram corresponding to the standard pronunciation acquired by the first acquisition unit 402;

a selecting subunit 4032, configured to select a target waveform spectrum segment matched with the second waveform spectrum segment from the first waveform spectrum segment generated by the generating subunit 4031;

a second determining subunit 4033, configured to determine, from the target sound acquired by the acquisition unit 401, a speech segment corresponding to the target waveform spectrum segment selected by the selecting subunit 4032.

By implementing the implementation mode, the waveform spectrogram corresponding to the target sound and the waveform spectrogram corresponding to the standard pronunciation can be obtained firstly, and then the obtained two waveform spectrograms are compared to obtain a similar waveform frequency spectrum segment, so that the voice segment obtained by comparing the two waveform spectrograms is more accurate according to the obtained voice segment of the similar waveform frequency spectrum segment.

Therefore, by implementing the learning device described in fig. 6, the speech segment matched with the standard pronunciation can be selected based on the complete speech, so that the learning device detects the selected speech segment, and the learning device can detect the speech standard more accurately. In addition, the implementation of the learning device described in fig. 6 can make the text information acquired by the learning device more comprehensive. In addition, the learning apparatus described in fig. 6 is implemented to make the speech segment obtained by comparing the two waveform spectrograms more accurate.

EXAMPLE seven

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 7, the electronic device may include:

a memory 701 in which executable program code is stored;

a processor 702 coupled to the memory 701;

wherein, the processor 702 calls the executable program code stored in the memory 701 to execute part or all of the steps of the method in the above method embodiments.

The embodiment of the invention also discloses a computer readable storage medium, wherein the computer readable storage medium stores program codes, wherein the program codes comprise instructions for executing part or all of the steps of the method in the above method embodiments.

Embodiments of the present invention also disclose a computer program product, wherein, when the computer program product is run on a computer, the computer is caused to execute part or all of the steps of the method as in the above method embodiments.

The embodiment of the present invention also discloses an application publishing platform, wherein the application publishing platform is used for publishing a computer program product, and when the computer program product runs on a computer, the computer is caused to execute part or all of the steps of the method in the above method embodiments.

It should be appreciated that reference throughout this specification to "an embodiment of the present invention" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase "in embodiments of the invention" appearing in various places throughout the specification are not necessarily all referring to the same embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

In addition, the terms "system" and "network" are often used interchangeably herein. It should be understood that the term "and/or" herein is merely one type of association relationship describing an associated object, meaning that three relationships may exist, for example, a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by instructions associated with a program, which may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), compact disc-Read-Only Memory (CD-ROM), or other Memory, magnetic disk, magnetic tape, or magnetic tape, Or any other medium which can be used to carry or store data and which can be read by a computer.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present invention, which is a part of or contributes to the prior art in essence, or all or part of the technical solution, can be embodied in the form of a software product, which is stored in a memory and includes several requests for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute part or all of the steps of the above-described method of each embodiment of the present invention.

The speech detection method and the learning device disclosed by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for speech detection, the method comprising:

2. The method of claim 1, wherein after detecting that the target speech segment is not present, the method further comprises:

and, the method further comprises:

3. The method according to claim 1 or 2, wherein the collecting the target sound in the environment where the learning device is located in the state of detecting that the finger of the user is in contact with the reading page comprises:

4. The method according to any one of claims 1 to 3, wherein the acquiring text information corresponding to the contact position of the finger of the user and the reading page and determining the standard pronunciation of the text information comprises:

5. The method according to any one of claims 1 to 4, wherein the selecting the speech segment matching the standard pronunciation from the target sound comprises:

6. A learning device, comprising:

7. The learning apparatus according to claim 6, characterized in that the learning apparatus further comprises:

and, the learning apparatus further includes:

8. The learning apparatus according to claim 6 or 7, wherein the acquisition unit includes:

9. The learning apparatus according to any one of claims 6 to 8, wherein the first acquisition unit includes:

10. The learning apparatus according to any one of claims 6 to 9, wherein the selecting unit includes: