CN114067787B

CN114067787B - Voice speech speed self-adaptive recognition system

Info

Publication number: CN114067787B
Application number: CN202111547185.4A
Authority: CN
Inventors: 邹月荣; 李�权; 汪张龙; 郭清霞; 李艳; 许东生; 杜平
Original assignee: Guangdong Xunfei Qiming Technology Development Co ltd
Current assignee: Guangdong Xunfei Qiming Technology Development Co ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-07-05
Anticipated expiration: 2041-12-17
Also published as: CN114067787A

Abstract

The invention provides a voice speech speed self-adaptive recognition system, which comprises a user input module and a self-adaptive processing module; the user input module is used for inputting voice information by a user, and the self-adaptive processing module comprises a voice conversion unit, a character dividing unit, an analysis unit and a self-adaptive processing unit; the voice conversion unit is used for converting voice information input by a user into character information; the character dividing unit is used for dividing the converted character information into independent characters; and the analysis unit performs analysis processing on the basis of the divided independent characters to obtain parameters of the divided character information. The invention can carry out self-adaptive recognition based on the speech rate of different users, thereby improving the comprehensive effectiveness of speech conversion of different users and solving the problem that the existing speech recognition has insufficient self-adaptation to the speech rate.

Description

Voice speech speed self-adaptive recognition system

Technical Field

The invention relates to the technical field of speech speed recognition, in particular to a speech speed self-adaptive recognition system.

Background

Speech recognition is a cross discipline. In the last two decades, the speech recognition technology has made a significant progress, and starts to move from the laboratory to the market, and in the next 10 years, the speech recognition technology will enter various fields such as industry, home appliances, communication, automotive electronics, medical treatment, home services, consumer electronics, and the like. The speech recognition ratio is made as "the auditory system of the machine". Speech recognition technology is a high technology that allows machines to convert speech signals into corresponding text or commands through a recognition and understanding process. The voice recognition technology mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology.

In the prior art, because the speaking habits of each person are different, the pause points of each sentence of each person are different, the speed of each person is also different, and the existing voice recognition system cannot perform accurate voice recognition conversion based on the characteristics, the phenomenon of character missing can occur in the recognition conversion process, and the final word semantics presented by voice are wrong.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a voice speech rate self-adaptive recognition system which can perform self-adaptive recognition based on the speech rates of different users, thereby improving the comprehensive effectiveness of voice conversion of different users and solving the problem that the existing voice recognition has defects in speech rate self-adaptation.

In order to achieve the purpose, the invention is realized by the following technical scheme: a speech speed self-adaptive recognition system comprises a user input module and a self-adaptive processing module; the user input module is used for inputting voice information by a user;

the self-adaptive processing module comprises a voice conversion unit, a character dividing unit, an analysis unit and a self-adaptive processing unit;

the voice conversion unit is used for converting voice information input by a user into character information;

the character dividing unit is used for dividing the converted character information into independent characters;

the analysis unit performs analysis processing based on the divided independent characters to obtain parameters of the divided character information;

and the self-adaptive processing unit is used for carrying out self-adaptive identification processing on the speech speed of the user according to the parameters of the divided character information.

Further, the word segmentation unit is configured with a word segmentation policy, and the word segmentation policy includes: the method comprises the steps of calibrating characters in character information in sequence, carrying out time length demarcation on voice input from input starting time to input ending time, sequentially corresponding the calibrated characters to the time length of the input voice, taking the time length as a time mark of the characters, and taking the time mark as a demarcation limit of each character.

Further, the analysis unit is configured with a speech rate analysis policy, where the speech rate analysis policy includes: acquiring the interval duration between every two characters, wherein the interval duration is obtained by subtracting the time mark of the previous character from the time mark of the next character, and then putting the interval duration into a time interval number set;

collecting interval thresholds with the interval duration less than or equal to one time into a first time duration set, collecting interval thresholds with the interval duration more than one time and less than or equal to two times into a second time duration set, and collecting interval thresholds with the interval duration more than two times into a third time duration set;

selecting one of the first duration set, the second duration set and the third duration set with the maximum interval duration as a speech rate identification number set;

and substituting a plurality of interval durations in the speech rate identification fixed number set into a speech rate formula to obtain a speech rate value.

Further, the speech rate formula is configured to:

where Vys is a speech rate value, T1 to Tn are respectively expressed as a number of interval durations in the speech rate identification fixed number set, n is a number of interval durations in the speech rate identification fixed number set, a1 is a conversion ratio of the speech rate value, and a1 is greater than zero.

Further, the interval threshold is calculated by an interval threshold formula, where the interval threshold formula is configured to:

yjg is interval threshold, Sz is the number of words in the word information, Tz is the duration of the voice information, b1 is the corresponding coefficient of the interval threshold, and a1 is greater than zero.

Further, the analysis unit is further configured with a sentence habit analysis policy, and the sentence habit analysis policy includes: acquiring interval duration in the third time duration set, taking the interval duration in the third time duration set as a separation point of each sentence, dividing the text information into sentences by using the separation point, and respectively counting the number of the texts in each sentence;

the method comprises the steps of collecting character threshold values with the number of characters less than or equal to one time into a first character number set;

collecting the characters with the number more than one time of character threshold values and less than or equal to two times of character threshold values into a second character number set;

collecting the character threshold values with the number of characters more than two times into a third character number set;

selecting one of the first character number set, the second character number set and the third character number set with the most data as a sentence habit reference number set; and substituting the number of the characters in the sentence habit reference number set into a sentence habit formula to obtain a sentence habit numerical value.

Further, the sentence habit formula is configured to:

pyx is a sentence habit numerical value, SI1 to Slm respectively represent a plurality of characters in a sentence habit reference number set, m represents a data number in the sentence habit reference number set, c1 is a sentence conversion reference value, and c1 is larger than zero.

Further, the text threshold is calculated by a text threshold formula, and the text threshold formula is configured to:

wherein Ywz is a text threshold, Wz is the total number of text in text information, Yz is the total number of sentences in text information after sentence break, d1 is a text threshold conversion ratio, and d1 is greater than zero.

Further, the adaptive processing unit is configured with an adaptive processing policy, the adaptive processing policy comprising: substituting the speech rate value and the sentence habit value of the user into a self-adaptive processing formula to obtain a voice input similarity value;

when the voice input similarity value of the input voice is larger than the voice threshold value of two times, the voice is marked as other fast voice;

when the voice input similarity value of the input voice is smaller than a voice threshold value of one time, marking the voice as other low-speed voice;

and when the voice input similarity value of the input voice is more than or equal to one time of voice threshold value and less than or equal to two times of voice threshold value, marking the voice as normal recognition voice.

Further, the adaptive processing formula is configured to: pxs-k 1 × Vys + k2 × Pyx; wherein Pxs is a voice input similarity value, k1 is a speech rate conversion value, k2 is a sentence habit conversion value, and k1 and k2 are both greater than zero.

The invention has the beneficial effects that: the self-adaptive processing module comprises a voice conversion unit, a character dividing unit, an analysis unit and a self-adaptive processing unit, wherein voice information input by a user can be firstly converted into character information through the voice conversion unit, the converted character information can be divided into independent characters through the character dividing unit, the divided independent characters can be analyzed and processed through the analysis unit to obtain parameters of the divided character information, and finally the self-adaptive processing unit can carry out self-adaptive recognition processing on the speech speed of the user according to the parameters of the divided character information, so that voice recognition can be carried out according to the speech speed and speaking sentence habits of different users, the recognition accuracy is improved, and the accuracy of voice semantic conversion is guaranteed.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a schematic diagram of the connection between the system and the user end according to the present invention;

FIG. 2 is a block diagram of the modules of the present invention.

In the figure: 1. an identification system; 11. a user input module; 12. an adaptive processing module; 121. a voice conversion unit; 122. a character dividing unit; 123. an analysis unit; 124. an adaptive processing unit.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Referring to fig. 1 and fig. 2, a speech speed adaptive recognition system, where the recognition system 1 includes a user input module 11 and an adaptive processing module 12; the user input module 11 is used for inputting voice information by a user, the user input module 11 is in communication connection with the user side 2, and the user can input the voice information through the user side 2;

the adaptive processing module 12 includes a voice conversion unit 121, a text division unit 122, an analysis unit 123, and an adaptive processing unit 124;

the voice conversion unit 121 is configured to convert voice information input by a user into text information;

the text dividing unit 122 is configured to divide the converted text information into independent texts; the word segmentation unit 122 is configured with a word segmentation policy, which includes: the method comprises the steps of calibrating characters in character information in sequence, carrying out time length demarcation on voice input from input starting time to input ending time, sequentially corresponding the calibrated characters to the time length of the input voice, taking the time length as a time mark of the characters, and taking the time mark as a demarcation limit of each character. By corresponding the characters to the time tracks, the subsequent character duration division processing can be facilitated.

The analysis unit 123 performs analysis processing based on the divided independent characters to obtain parameters of the divided character information; the analysis unit 123 is configured with a speech rate analysis policy, where the speech rate analysis policy includes: acquiring the interval duration between every two characters, wherein the interval duration is obtained by subtracting the time mark of the previous character from the time mark of the next character, and then putting the interval duration into a time interval number set;

collecting interval thresholds with interval duration less than or equal to one time into a first time duration number set, collecting interval thresholds with interval duration more than one time and less than or equal to two times into a second time duration number set, and collecting interval thresholds with interval duration more than two times into a third time duration number set;

selecting one of the first duration set, the second duration set and the third duration set with the maximum interval duration as a speech rate identification number set; the most interval duration in the speech rate identification set can represent the speech rate between two characters normally connected by the user, so that the speech rate value is calculated most reasonably by selecting the data in the set.

The speech rate formula is configured to:

the speech speed value obtained by calculating the interval duration in the speech speed identification fixed number set can more accurately represent the speech speed condition of the user and represent the duration between two characters normally connected by the user, wherein Vys is the speech speed value, T1 to Tn are respectively represented as the interval durations in the speech speed identification fixed number set, n is the number of the interval durations in the speech speed identification fixed number set, a1 is the conversion ratio of the speech speed value, and a1 is greater than zero.

The interval threshold is calculated by an interval threshold formula, and the interval threshold formula is configured as follows:

yjg is interval threshold, Sz is the number of words in the word information, Tz is the duration of the voice information, b1 is the corresponding coefficient of the interval threshold, and a1 is greater than zero. The interval threshold is processed and calculated, and is calculated based on the number of characters in the character information of the user and the duration of the voice information, so that the setting of the interval threshold is not a fixed numerical value and is obtained according to different characteristics of each user, and the speed characteristic of each user can be more prominent when the interval threshold is divided.

The analysis unit 123 is further configured with a sentence habit analysis policy, which includes: acquiring interval duration in the third time duration set, taking the interval duration in the third time duration set as a separation point of each sentence, dividing the text information into sentences by using the separation point, and respectively counting the number of the texts in each sentence;

collecting the character threshold values with the number of characters more than twice into a third character number set;

selecting one of the first character number set, the second character number set and the third character number set with the most data as a sentence habit reference number set; the data in the sentence habit reference number set can more accurately represent the speaking habit of the user, and some users pause after saying a long sentence, but pause without writing the user habit and saying a plurality of characters, so the speech rate feature is also the important reference data for carrying out voice recognition on the user. And substituting the number of the characters in the sentence habit reference number set into a sentence habit formula to obtain a sentence habit numerical value.

The sentence habit formula is configured as:

pyx is a sentence habit numerical value, Sl1 to Slm respectively represent a plurality of character quantities in the sentence habit reference number set, m represents a data quantity in the sentence habit reference number set, c1 is a sentence conversion reference value, and c1 is larger than zero. The sentence speaking habits of the user can be represented more accurately by selecting the number of the characters in the sentence habit reference number set for processing and calculation.

The text threshold is calculated by a text threshold formula, wherein the text threshold formula is configured as follows:

wherein Ywz is a text threshold, Wz is the total number of text in text information, Yz is the total number of sentences in text information after sentence break, d1 is a text threshold conversion ratio, and d1 is greater than zero. Through the processing calculation of the character threshold value, the character threshold value is based on the sum of the total number of characters of the character informationThe sentence total number of the text information after sentence breaking is calculated, the design can ensure that the text threshold value is obtained according to the sentence habit characteristics of each user and is not a fixed numerical value, and therefore, the accuracy after division can be better ensured when the text threshold value is used for division.

The adaptive processing unit 124 is configured to perform adaptive recognition processing on the speech rate of the user according to the parameters of the divided text information. The adaptive processing unit 124 is configured with adaptive processing strategies that include: substituting the speech rate value and the sentence habit value of the user into a self-adaptive processing formula to obtain a voice input similarity value;

The adaptive processing formula is configured to: pxs-k 1 × Vys + k2 × Pyx; wherein Pxs is a voice input similarity value, k1 is a speech rate conversion value, k2 is a sentence habit conversion value, and k1 and k2 are both greater than zero. The voice input similarity value is calculated based on the speed value and the sentence habit value of the user, wherein k1 represents the weight of the speed value in the voice input similarity value, and k2 represents the weight of the sentence habit value in the voice input similarity value.

The working principle is as follows: the user inputs voice through the voice information input by the user, the input voice is transmitted to the processing module to be processed, the input voice information can be firstly converted into character information through the voice conversion unit 121, the converted character information can be divided into independent characters through the character dividing unit 122, the divided independent characters can be analyzed through the analysis unit 123 to obtain parameters of the divided character information, and finally the self-adaptive processing unit 124 can perform self-adaptive recognition processing on the speed of the user according to the parameters of the divided character information, so that voice recognition can be performed according to the speed of the different users and the habits of speaking sentences, and the recognition accuracy is improved.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A speech rate adaptive recognition system, characterized in that said recognition system (1) comprises a user input module (11) and an adaptive processing module (12); the user input module (11) is used for inputting voice information by a user;

the self-adaptive processing module (12) comprises a voice conversion unit (121), a character dividing unit (122), an analysis unit (123) and a self-adaptive processing unit (124);

the voice conversion unit (121) is used for converting voice information input by a user into character information;

the character dividing unit (122) is used for dividing the converted character information into independent characters;

the analysis unit (123) performs analysis processing based on the divided independent characters to obtain parameters of the divided character information;

the self-adaptive processing unit (124) is used for carrying out self-adaptive identification processing on the speech speed of the user according to the parameters of the divided text information;

the word segmentation unit (122) is configured with a word segmentation strategy, which comprises: sequentially calibrating characters in character information, then carrying out time length demarcation on the voice input from the input starting time to the input ending time, sequentially corresponding the calibrated characters to the time length of the input voice, taking the time length as a time mark of the characters, and taking the time mark as a demarcation limit of each character;

the analysis unit (123) is configured with a speech rate analysis strategy comprising: acquiring the interval duration between every two characters, wherein the interval duration is obtained by subtracting the time mark of the previous character from the time mark of the next character, and then putting the interval duration into a time interval number set;

bringing a plurality of interval durations in the speech rate identification fixed number set into a speech rate formula to obtain a speech rate value;

the analysis unit (123) is further configured with a sentence habit analysis policy comprising: acquiring interval duration in the third time duration set, taking the interval duration in the third time duration set as a separation point of each sentence, dividing the text information into sentences by using the separation point, and respectively counting the number of the texts in each sentence;

selecting one of the first character number set, the second character number set and the third character number set with the most data as a sentence habit reference number set; substituting the number of the characters in the sentence habit reference number set into a sentence habit formula to obtain a sentence habit numerical value;

wherein Ywz is a text threshold, Wz is the total number of text information, Yz is the total number of sentences of text information after sentence break, d1 is a text threshold conversion ratio, and d1 is greater than zero;

the adaptive processing unit (124) is configured with an adaptive processing policy comprising: substituting the speech rate value and the sentence habit value of the user into a self-adaptive processing formula to obtain a voice input similarity value;

2. The adaptive speech rate recognition system of claim 1, wherein the speech rate formula is configured to:

wherein Vys is the speech rate value, T1 to Tn are respectively expressed as a plurality of interval durations in the speech rate identification number set, and n is the speech rate identification number setA1 is the conversion ratio of speech rate values, and a1 is greater than zero.

3. The system according to claim 1, wherein the interval threshold is calculated by an interval threshold formula, and the interval threshold formula is configured to:

4. The adaptive speech pace recognition system of claim 1, wherein the sentence habit formula is configured to:

pyx is a sentence habit numerical value, Sl1 to Slm respectively represent a plurality of character quantities in the sentence habit reference number set, m represents a data quantity in the sentence habit reference number set, c1 is a sentence conversion reference value, and c1 is larger than zero.

5. The adaptive speech rate recognition system of claim 1, wherein the adaptive processing formula is configured to: pxs-k 1 × Vys + k2 × Pyx; wherein Pxs is a voice input similarity value, k1 is a speech rate conversion value, k2 is a sentence habit conversion value, and k1 and k2 are both greater than zero.