Disclosure of Invention
In order to solve the above technical problems, the present invention provides a method and a system for speech recognition and conversion, which are used to automatically recognize the language of speech data and convert the speech data into text data having the same language as the speech data.
The embodiment of the invention provides a voice recognition and conversion method, which comprises the following steps:
s101, acquiring voice data to be recognized;
s102, identifying a language family corresponding to the voice data according to a plurality of language family databases;
s103, acquiring the language family database corresponding to the voice data from a plurality of language family databases according to the language family; the language family database comprises a plurality of language category databases;
s104, obtaining languages corresponding to the voice data from the language data sub-databases;
s105, converting the voice data into text data corresponding to the language according to a text conversion database;
s106, extracting keyword data of the text data;
s107, acquiring keyword voice data corresponding to the keyword data in the voice data, and storing the keyword data and the keyword voice data into the text conversion database.
In one embodiment, the plurality of linguistic databases include an Hinokai coefficient database, a Xianhui linguistic database, an Altai linguistic database, an Urael linguistic database, a Gauss linguistic database, a Hanzang linguistic database, and a Delavantian linguistic database.
In one embodiment, after the step S101 of acquiring the voice data to be recognized, the method includes: the voice data are preprocessed; the method comprises the following specific steps:
detecting and acquiring a mute interval in the voice data;
and filtering the voice data according to the mute interval to obtain the voice data after filtering.
In an embodiment, in the step S102, a language family corresponding to the voice data is identified according to a plurality of language family databases; the method comprises the following specific steps:
obtaining language family data of the voice data; the method specifically comprises the following steps:
dividing the voice data into two sections of sub-voice data according to equal voice duration, and respectively extracting the audio features of the two sections of sub-voice data to form two voice frequency feature matrixes; and obtaining language family data through the following formula (1):
wherein F is language family data, (Y)1Y2…Yn) For the first segment speech audio feature matrix, (y)1y2…yn) A second section of voice audio characteristic matrix;
comparing the language family data with preset language family threshold data in a plurality of language family databases to obtain a language family corresponding to the voice data;
the language family threshold data comprises Indonesian family threshold value data corresponding to the Indonesian coefficient database, flashover containing language family threshold value data corresponding to the flashover containing language family database, Altai language family threshold value data corresponding to the Altai language family database, Uala language family threshold value data corresponding to the Uala language family database, Gauss language family threshold value data corresponding to the Gauss language family database, Hanzang language family threshold value data corresponding to the Hanzang language family database and Deltaverda language family threshold value data corresponding to the Deltaverda language family database.
In one embodiment, after the step S102, the method further includes:
judging whether the language family identification of the voice data is successful;
if the identification is successful, executing the step S103;
if the recognition fails, calculating the distance data between the language family classes of the voice data and the language family threshold data according to the language family data and the language family threshold data;
acquiring minimum data in the distance between the language family classes, and taking the language family corresponding to the minimum data as the language family of the voice data;
the inter-lingual-family distance includes inter-hindu-lingual-family distance data between the lingual-family data and the hindu-lingual-family threshold data, inter-lingual-family data between the lingual-family data and the hindu-lingual-family threshold data, inter-albedo-family data between the lingual-family data and the albedo-family threshold data, inter-urale-family data between the lingual-family data and the ullari-family threshold data, inter-caucasian-family data between the lingual-family data and the caucasian-family threshold data, inter-chinese-Tibetan-family data between the lingual-family data and the chinese-Tibetan-family threshold data, and inter-dravada-family distance between the lingual-family data and the dravada-family threshold data.
In one embodiment, the S106, extracting keyword data of the text data; the method comprises the following specific steps:
performing word segmentation processing on the text data to obtain a plurality of word groups; the method specifically comprises the following steps:
establishing a word segmentation model; the specific steps are as follows S201-S203:
s201 marks the first word in the text data as B,
s202, extracting a next character marked as B in the text data, marking the next character as C, simultaneously extracting all previous characters of the characters corresponding to C in the text data, removing duplication, and forming a set D, and judging whether the character marked as B is an end field of a word by using a formula (2);
wherein, P1,P2For an intermediate function, length (D) is the number of words in the middle of the set D, p (B) is the probability of the occurrence of a word labeled B, p (C) is the probability of the occurrence of a word labeled C, length (all) is the total length of the text, p (bc) is the probability of the occurrence of both a word labeled B and a word labeled C, if B ═ B is finally obtained, label B is unchanged, and if B ═ E, label B is changed to label E;
s203, judging whether the character C is the last character, if so, changing the label C into a label E, and ending word segmentation; if not, changing the label C into a label B, and repeating the steps S202 and S203;
the step of segmenting the text data comprises the following steps:
adding cutting lines behind the initial stage of the text data and all fields marked as E, taking a phrase between any two cutting lines, extracting all phrases to form a phrase vector F1, removing repeated values from the phrase vector F1 to form a corresponding phrase set F2, wherein the phrases in the set F2 are obtained after word segmentation, and the number of the phrases contained in the set F2 is N;
extracting keyword data in the phrases; the method comprises the following specific steps:
firstly, calculating the key score of each phrase in a set F2 by using a formula (3);
wherein Q isiIs the score of the ith phrase in F2, e is the natural constant, light (F2)i) Is the length of the ith phrase in F2, P (F2)i) The number of times the length of the ith phrase in F2 appears in vector F1, i is 1, 2, 3 … … n;
determining keyword data using formula (4);
gjc=find(max(Q1,Q2,Q3……QN))
(4)
wherein gjc is the final obtained keyword, find (A) is the keyword corresponding to the value of A, max () finds the maximum value; the word corresponding to gjc is the determined keyword data.
A speech recognition conversion system comprises an acquisition module, a language family recognition module, a database selection module, a language type recognition module, a text conversion module, a keyword extraction module and a database updating module; the acquisition module is used for acquiring voice data to be recognized;
the language family identification module is used for identifying a language family corresponding to the voice data according to a plurality of language family databases;
the database selection module is used for acquiring the language family database corresponding to the voice data from a plurality of language family databases according to the language family; the language family database comprises a plurality of language category databases;
the language identification module is used for acquiring languages corresponding to the voice data from a plurality of language data sub-databases;
the text conversion module is used for converting the voice data into text data corresponding to the language according to a text conversion database;
the keyword extraction module is used for extracting keyword data of the text data;
and the database updating module is used for acquiring keyword voice data corresponding to the keyword data in the voice data and storing the keyword data and the keyword voice data into the text conversion database.
In one embodiment, the text conversion database comprises an information category identification unit, a first storage area and a second storage area;
the information category identification unit is used for transmitting the keyword voice data to the first storage area and transmitting the keyword data to the second storage area; the first storage area is used for storing the keyword voice data after being operated by a first encryption algorithm; the second storage area is used for storing the keyword data after the keyword data is operated by a second encryption algorithm; the first storage area is also stored with a storage address of the keyword data corresponding to the keyword voice data;
the first encryption algorithm or the second encryption algorithm comprises one or more of an equal-value encryption algorithm and a symmetric encryption algorithm.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The embodiment of the invention provides a voice recognition and conversion method, as shown in fig. 1, the method comprises the following steps:
s101, acquiring voice data to be recognized;
s102, recognizing a language family corresponding to the voice data according to a plurality of language family databases;
s103, acquiring a language family database corresponding to the voice data from a plurality of language family databases according to the language family; the language family database comprises a plurality of language category databases;
s104, acquiring languages corresponding to the voice data from a plurality of language data sub-databases;
s105, converting the voice data into text data corresponding to languages according to the text conversion database;
s106, extracting keyword data of the text data;
s107, keyword voice data corresponding to the keyword data in the voice data are obtained, and the keyword data and the keyword voice data are stored in a text conversion database.
The working principle of the method is as follows: acquiring a language family corresponding to the voice data to be recognized through a plurality of language family databases; selecting a language family database corresponding to the voice data according to the language family, wherein a plurality of language category databases are stored in the language family database; acquiring the language of the voice data to be recognized through a plurality of language data sub-databases; converting the voice data into text data corresponding to the language according to the text conversion database;
and extracting keyword data in the text data, and acquiring keyword voice data corresponding to the keyword data from the voice data, and transmitting the keyword voice data to a text conversion database for storage.
The method has the beneficial effects that: the language family of the voice data is acquired through a plurality of language family databases; the language of the voice data is acquired through a plurality of language data sub-databases in the language family database; the voice data is converted into the text data according to the language according to the text conversion database; thereby realizing the function of voice recognition and conversion; the method converts the acquired voice data into text data in the same language as the voice data through language identification, thereby realizing the conversion of the voice data into the text data; and the conversion of the voice data of different languages is realized through a plurality of language family databases and a plurality of language category databases in the language family databases. Extracting keyword data in the generated text data, acquiring keyword voice data corresponding to the keyword data in the voice data, and transmitting the keyword voice data and the keyword data to a text conversion database for storage, so that the text conversion database is updated, and the efficiency of later voice recognition conversion is further improved; the inconvenience that the language of voice conversion needs to be manually set in the voice conversion process in the traditional technology is solved, the language of the voice data can be automatically identified, and the voice data can be converted into text data with the same language as the voice data.
In one embodiment, the plurality of language family databases include an Hinokai coefficient database, a Xiancheng language family database, an Altai language family database, an Urael language family database, a Gausso language family database, a Hanzang language family database, and a Delavar language family database. In the technical scheme, the seven language family database is arranged according to the seven major language families of the world, so that the language family of the voice data is identified.
In one embodiment, after acquiring the voice data to be recognized in step S101, the method includes: for preprocessing the speech data; the method comprises the following specific steps:
detecting and acquiring a mute interval in voice data;
and filtering the voice data according to the mute interval to obtain the voice data after filtering. According to the technical scheme, the mute section in the voice data is filtered by detecting the mute section, so that the time required by the work of the subsequent steps is reduced, and the work efficiency is improved.
In one embodiment, step S102, recognizing a language family corresponding to the voice data according to a plurality of language family databases; the method comprises the following specific steps:
obtaining language family data of voice data; the method specifically comprises the following steps: dividing the voice data into two sections of sub-voice data according to equal voice duration, and respectively extracting the audio features of the two sections of sub-voice data to form two voice frequency feature matrixes; and obtaining language family data through the following formula (1):
wherein F is language family data, (Y)1Y2…Yn) For the first segment speech audio feature matrix, (y)1y2…yn) A second section of voice audio characteristic matrix;
comparing the language family data with preset language family threshold data in a plurality of language family databases to obtain a language family corresponding to the voice data;
language system threshold data comprises Indonesian system threshold value data corresponding to an Indonesian coefficient database, sudden change language system threshold value data corresponding to a sudden change language system database, Altai language system threshold value data corresponding to an Altai language system database, Urale language system threshold value data corresponding to an Urale language system database, Gauss language system threshold value data corresponding to a Gauss language system database, Hanzan language system threshold value data corresponding to a Hanzan language system database and Delava language system threshold value data corresponding to a Delava language system database. In the above technical solution, language family data of the speech data is obtained, and the language family data is compared with language family threshold data corresponding to a plurality of preset language family databases, and when the language family data is within a language family threshold data range corresponding to a certain language family database, the speech data is determined to be a language family corresponding to the language family database, thereby implementing language identification of the speech data.
For example: the language family data of the obtained voice data is 3.45; the Indonesian system threshold value data corresponding to the Indonesian coefficient database is 1-2, the sudden change language system threshold value data corresponding to the sudden change language system database is 3-4, the Altai system threshold value data corresponding to the Altai system database is 5-6, the Urale system threshold value data corresponding to the Urale system database is 7-8, the Gauss system threshold value data corresponding to the Gauss system database is 9-10, the Hanzang system threshold value data corresponding to the Hanzang system database is 11-12 and the Delavada system threshold value data corresponding to the Delavada system database is 13-14; then, it is determined that the language of the voice data is a flash language.
In one embodiment, after step S102, the method further comprises:
judging whether language family identification on the voice data is successful or not;
if the identification is successful, executing step S103;
if the recognition fails, calculating the distance data between the language family classes of the voice data and the language family threshold data according to the language family data and the language family threshold data;
acquiring minimum value data in the distance between language family classes, and taking a language family corresponding to the minimum value data as a language family of the voice data;
the inter-lingual class distance includes inter-Hindu class distance data between the lingual data and the Hindu threshold value data, inter-lingual class data between the lingual data and the threshold value data, inter-Altai class data between the lingual data and the Altai class threshold value data, inter-Ural class data between the lingual data and the threshold value data, inter-Gauss class data between the lingual data and the threshold value data, inter-Hankushi class data between the lingual data and the threshold value data, and inter-Delavar class distance between the lingual data and the threshold value data. In the technical scheme, whether the language family identification of the voice data is successful is judged, and after the language family identification is successful, the subsequent steps are executed; after the language family identification fails, calculating a plurality of language family class distance data between the language family data and a plurality of language family threshold data, and using the minimum value data in the language family class distance as the language family of the voice data, thereby realizing the accurate identification of all the voice data language families.
For example: the language family data of the obtained voice data is 4.65; the Indonesia-European system threshold value data is 1-2, the Cissus-containing language system threshold value data is 3-4, the Altai language system threshold value data is 5-6, the Urale language system threshold value data is 7-8, the Gauss system threshold value data is 9-10, the Hanzang language system threshold value data is 11-12 and the Delauda language system threshold value data is 13-14; if the language family data 4.65 of the voice data is not in any language family threshold data, the recognition is failed;
calculating to obtain the distance data between Hindi language family class between 3.45 of language family data and Hindi language family threshold value data 1-2 as 2.65, the data between Hindi language family class between language family data and Hindi language family threshold value data 3-4 as 0.65, the data between Tai language family class between language family data and Altai language family threshold value data 5-6 as 0.35, the data between Urale language family class between language family data and Urale language family threshold value data 7-8 as 2.35, the data between Gauss language family class between language family data and Gauss family threshold value data 9-10 as 4.35, the data between Tibetan language family between language family data and Hanzi language family threshold value data 11-12 as 6.35 and the distance between Deltada language family class between language family data and Deltada language family threshold value data 13-14 as 8.35; the minimum value data among the inter-language-family distances is 0.35, and the language family of the speech data is determined to be the Altai language family.
In one embodiment, S106, extracting keyword data of the text data; the method comprises the following specific steps:
performing word segmentation processing on the text data to obtain a plurality of word groups; the method specifically comprises the following steps:
establishing a word segmentation model; the specific steps are as follows S201-S203:
s201, marking the first character in the text data as B,
s202, extracting a next character marked as B in the text data, marking the next character as C, simultaneously extracting all previous characters of the character corresponding to C in the text data, removing duplication, forming a set D, and judging whether the character marked as B is an end field of a word by using a formula (2);
wherein, P1,P2Is composed ofAn inter-function, length (D) is the number of words in the middle of the set D, p (B) is the probability of the occurrence of a word labeled as B, p (C) is the probability of the occurrence of a word labeled as C, length (all) is the total length of the text, p (bc) is the probability of the occurrence of both a word labeled as B and a word labeled as C, if B is finally equal to B, B is labeled unchanged, and if B is equal to E, B is labeled as E; by using the formula (2), the text data can be segmented without an additional sample database, and when the segmentation is processed, only the j +1 th character needs to be judged when the j th character is considered, so that the judgment calculation amount is greatly reduced.
S203, judging whether C is the last character, if so, changing the label C into a label E, and ending the word segmentation; if not, changing the label C into the label B, and repeating the steps S202 and S203;
the text data word segmentation step comprises the following steps:
adding cutting lines behind the initial stage of the text data and all fields marked as E, taking a phrase between any two cutting lines, extracting all phrases to form a phrase vector F1, removing repeated values from the phrase vector F1 to form a corresponding phrase set F2, wherein the phrases in the set F2 are obtained after word segmentation, and the number of the phrases contained in the set F2 is N;
extracting keyword data in the phrases; the method comprises the following specific steps:
firstly, calculating the key score of each phrase in a set F2 by using a formula (3);
wherein Q isiIs the score of the ith phrase in F2, e is the natural constant, light (F2)i) Is the length of the ith phrase in F2, P (F2)i) The number of times the length of the ith phrase in F2 appears in vector F1, i is 1, 2, 3 … … n; when the keyword data is solved by using the formula (3), the keyword data is not only determined under the condition that the occurrence frequency of the phrases is the most, but also the phrase length is fully considered,and the phenomenon that some independent moods are changed into keyword data is avoided.
Determining keyword data using formula (4);
gjc=find(max(Q1,Q2,Q3……QN))
(4)
wherein gjc is the final obtained keyword, find (A) is the keyword corresponding to the value of A, max () finds the maximum value; the word corresponding to gjc is the determined keyword data. By the aid of the keyword data determined by the technical scheme, the keyword data can be acquired by a small amount of calculation under the condition that the text data does not need any external sample database, and accordingly efficiency of acquiring the keyword data is effectively improved; in the above technical solution, the keyword data in the text data is obtained through the formulas (2), (3) and (4), and the keyword data and the keyword voice data are transmitted to the text conversion database through the step S107, so that the text conversion database is automatically updated, and the text conversion efficiency of the step S105 is further improved.
A speech recognition conversion system, as shown in FIG. 2, includes an obtaining module 21, a language family recognition module 22, a database selection module 23, a language type recognition module 24, a text conversion module 25, a keyword extraction module 26 and a database update module 27; wherein the content of the first and second substances,
an obtaining module 21, configured to obtain voice data to be recognized;
a language family recognition module 22, configured to recognize a language family corresponding to the voice data according to a plurality of language family databases;
a database selection module 23, configured to obtain, according to a language family, a language family database corresponding to the voice data from the multiple language family databases; the language family database comprises a plurality of language category databases;
a language identification module 24, configured to obtain languages corresponding to the voice data from the plurality of language databases;
a text conversion module 25, configured to convert the voice data into text data corresponding to the language according to the text conversion database;
a keyword extraction module 26 for extracting keyword data of the text data;
and a database updating module 27, configured to obtain keyword voice data corresponding to the keyword data in the voice data, and store the keyword data and the keyword voice data in the text conversion database.
The working principle of the system is as follows: the obtaining module 21 transmits the voice data to the language family recognition module 22; the language family recognition module 22 obtains a language family corresponding to the voice data according to the plurality of language family databases and transmits the language family to the database selection module 23; the database selection module 23 is configured to obtain a language family database corresponding to the voice data from the multiple language family databases according to the language family; the language identification module 24 obtains the language corresponding to the voice data according to a plurality of language databases in the language family database; a text conversion module 25, configured to convert the voice data into text data according to the obtained language, according to a text conversion database;
a keyword extraction module 26, configured to extract keyword data in the text data; and the database updating module 24 is used for acquiring the keyword voice data corresponding to the keyword data from the voice data according to the keyword data, and transmitting and storing the keyword data and the keyword voice data to the text conversion database.
The beneficial effect of above-mentioned system lies in: the language family identification module is used for realizing the acquisition of the language family of the voice data; the language of the voice data is acquired through the database selection module and the language identification module; the voice data is converted into the text data according to the language by the text conversion module according to the text conversion database; thereby realizing the function of voice recognition and conversion; the system converts the acquired voice data into text data in the same language as the voice data through language identification, thereby realizing the conversion of the voice data into the text data; and the conversion of the voice data of different languages is realized through a plurality of language family databases and a plurality of language category databases in the language family databases. Extracting keyword data in the generated text data through a keyword extraction module; the method comprises the steps of acquiring keyword voice data corresponding to the keyword data in the voice data through a database updating module, and transmitting the keyword voice data and the keyword data to a text conversion database for storage, so that the text conversion database is updated, and the voice recognition conversion efficiency of the system is further improved; the inconvenience that the language of voice conversion needs to be manually set in the voice conversion process in the traditional technology is solved, so that the system can automatically identify the language of the voice data and convert the voice data into text data with the same language as the voice data.
In one embodiment, the text conversion database comprises an information category identification unit, a first storage area and a second storage area;
the information category identification unit is used for transmitting the key word sound data to the first storage area and transmitting the key word data to the second storage area; the first storage area is used for storing the keyword voice data after being operated by a first encryption algorithm; the second storage area is used for storing the keyword data after the keyword data is operated by a second encryption algorithm; the first storage area is also stored with a storage address of keyword data corresponding to the keyword sound data;
the first encryption algorithm or the second encryption algorithm comprises one or more of an equivalent encryption algorithm and a symmetric encryption algorithm. According to the technical scheme, the key word sound data and the key word data are respectively transmitted to the first storage area and the second storage area for storage through the information category identification unit, and the first storage area and the second storage area respectively adopt the first encryption algorithm and the second encryption algorithm to encrypt the stored data, so that the safety of the stored data of the text conversion database is effectively improved.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.