CN115019777A

CN115019777A - Online learning voice recognition response device and method

Info

Publication number: CN115019777A
Application number: CN202210695667.2A
Authority: CN
Inventors: 胡劲松; 冯思铭; 贺映玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-06
Anticipated expiration: 2042-06-20
Also published as: CN115019777B

Abstract

The invention discloses an online learning voice recognition response device and method, which are used for recognizing the voice of telephone conversation into characters and giving out relevant machine voice responses according to the characters, in particular to an online learning function of the automatic telephone response device, which can be used for replacing telephone manual customer service, a telephone consultation system, a telephone command decision-making system and the like. The invention realizes 2-path analog voice recognition by using a sound card of a common computer, and combines the difference frequency principle to recognize and extract special words in the dialogue voice, thereby improving the voice recognition rate and the accuracy rate of answer.

Description

Online learning voice recognition response device and method

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition response device and method for online learning.

Background

In order to solve the problems brought forward by customers, many companies adopt telephone manual customer service systems, a large number of customer service personnel need to be engaged, time and labor are wasted, cost is increased, consultation services are difficult to provide in 24 hours all day, with the development of artificial intelligence, some automatic response systems or devices exist at present, but most of the automatic response systems or devices can only answer some simple problems mechanically, and many times of manual customer service intervention is needed. There are several technical problems here:

1. the inaccurate speech recognition of the telephone cannot obtain the accurate characters of the customer questions, so that the accurate answers cannot be found, and one important reason is that: customer service response systems are usually targeted to specific specialized users, and a large number of specialized terms are usually involved in the response process, as well as place names, store names, and device names with specific numbers that are specific to various local departments or stores. Due to the existence of a large number of homophones in the language, the existing speech recognition technology often recognizes the special vocabulary which is rarely used at ordinary times into other common vocabularies, so that the error rate is high, and the requirement of professional response is difficult to meet. Therefore, the main reasons for the above problems are: the current voice recognition technology is based on frequency priority matching, and when the voice is changed into pinyin, the current voice recognition technology can preferentially match general words and popular words which have higher frequency at ordinary times;

2. the difficulty of searching answers directly through the question and sentence of a client is very high, because the current semantic understanding technology is researched, the commercial requirement cannot be met, secondly, the character expression mode of human language is changeable, the same meaning can have a plurality of expression modes, the matching is difficult to be realized by using a fixed sentence pattern, and the result causes that the answer of a machine is frequently asked and the manual intervention is needed all the time;

3. the problems of the customers are strange and difficult to predict, and a fixed answer library is difficult to deal with;

furthermore, the expert telephone consultation system, the decision command intelligent response system and the power dispatching intelligent response system are response devices or systems, are the same as the client response system in principle and face the same problems, and the intelligent sound box does not use a telephone, but is also voice response, and the response is not satisfactory. In addition, the customer service system responding by text, such as the customer service of the e-commerce, also faces the 2 nd and 3 rd problems.

Disclosure of Invention

The first purpose of the invention is to overcome the defects of the prior art and provide an online learning voice recognition and response device, which can accurately recognize the voice as characters, can automatically respond, and can be used for online learning manual customer service to continuously supplement the existing answer library.

The second purpose of the invention is to provide a method for converting the voice into the characters of the voice recognition response device for online learning.

The third purpose of the invention is to provide an automatic character response method for online learning of the online learning voice recognition response device.

The fourth purpose of the invention is to provide a method for extracting words from sentences and searching and sequencing the result for the online learning voice recognition answering device.

For purposes of normalization, the terms used in connection with the present invention are defined as follows: the vocabulary refers to Chinese vocabulary, all short names and alternative names of the vocabulary are stored together with the vocabulary and are regarded as the same vocabulary, the local special vocabulary refers to the vocabulary only used in a local machine, a local area network and a specific region, group or department, the local special vocabulary and the professional term are collectively called special vocabulary, and the other words are called general vocabulary; the word frequency refers to the frequency of occurrence of a word; the difference frequency refers to the frequency difference of the vocabulary; the matching is to find the similarity between a part of pinyin in the pinyin string a and the correct pinyin of a Chinese word or word, which is also referred to as matching between pinyin and character or word in the present invention.

The first purpose of the invention is realized by the following technical scheme: an online learning voice recognition answering device, comprising: the system comprises a voice-to-character module, a response generation unit and a voice synthesis unit;

the voice-to-character module identifies voice digital signals of a questioner as corresponding characters and outputs the corresponding characters to the response generation unit, and the questioner indicates a person who asks a question; the voice-to-character module also identifies the voice digital signals of the respondents as corresponding characters and outputs the characters to the response generating unit, the respondents refer to the people who answer the questions, and the voice-to-character module can respectively realize the conversion of the two different voice sources in a time-sharing working mode;

the response generating unit inquires according to the characters converted by the questioner voice to generate corresponding answer characters; the answer generating unit can generate a new answer by using the characters generated by the answer person voice for later inquiry;

the voice synthesis unit synthesizes voice signals according to the characters output by the response generation unit and outputs the voice signals to the sound production device to realize machine voice response.

Preferably, the online learning voice recognition response device further includes: telephone listeners, sound cards; the voice-to-text module comprises two subunits capable of independently and simultaneously working: a first voice-to-text unit and a second voice-to-text unit;

the first voice-to-character unit identifies the voice digital signals of the respondents as corresponding characters and outputs the corresponding characters to the response generating unit;

the second voice-to-character unit identifies the voice digital signal of the questioner as corresponding characters and outputs the corresponding characters to the response generating unit;

the telephone listener and the telephone of the dispatcher are connected in parallel on the same telephone Line, 2 paths of analog voice signals for the conversation of the answering person and the questioning person are obtained and are respectively and correspondingly output to a first Line in interface and a second Line in interface of the sound card;

the sound card comprises a first Line in interface and a second Line in interface, 2 paths of analog voice signals of the answering person and the questioning person are converted into 2 paths of digital signals through an analog/digital circuit of the sound card, and then the 2 paths of digital signals are correspondingly output to the first voice to text unit and the second voice to text unit respectively.

Preferably, the sound card, the first voice to text unit, the second voice to text unit, the response generation unit and the voice synthesis unit are all built in the same computer, and the first voice to text unit and the second voice to text unit are respectively realized by two cores of a CPU of the computer in parallel.

Preferably, the online learning voice recognition response device further includes:

the difference frequency special word bank unit is used for storing graded special words and pinyin thereof so as to be used for inquiring the voice-to-character unit, thereby improving the matching accuracy of the special words, the level of the words is determined by the difference of two frequencies of the words, namely the higher the frequency of the words appearing in special data is, the higher the level of the words is, and the higher the frequency of the words appearing in general data is, the lower the level of the words is, the words refer to Chinese words, all the acronyms and other names of the words are stored together with the words, and the same word is calculated, and the special words comprise local special words and professional terms; the local special vocabulary refers to vocabulary only used in a local machine, a local area network and a specific region, group or department, the special vocabulary in the same level is stored in the same sub-library, the highest sub-library is a first-level sub-library, the next two levels to the lowest sub-libraries are sequentially arranged, and the vocabulary stored in the difference frequency special vocabulary library unit is called difference frequency special vocabulary or difference frequency vocabulary;

the subject term sharing unit is used for extracting subject terms in existing dialogue texts of a questioner and an answerer and providing the subject terms for the first voice-to-text unit and the second voice-to-text unit for inquiry so as to improve the subsequent dialogue recognition rate, and comprises the following modules:

subject term determination module: counting repeated vocabularies and repeated times of the repeated vocabularies; if the repeated vocabulary is the special vocabulary for difference frequency, adding the vocabulary into a subject word queue, otherwise, discarding the repeated vocabulary, wherein the former is the character obtained by converting the existing dialogue voice by the first voice-to-character unit and the second voice-to-character unit;

the subject word queue ordering module: if n dialogue speech sentences are recognized as n character sentences from the start of the speech recognition to the current speech sentence to be recognized, and the current speech sentence to be recognized is numbered as the (n + 1) th sentence, the topic value of a repeated vocabulary is:

wherein i and j are repeated when the vocabulary is in the i and j sentences, the ellipses represent other repeated character sentences, i, j is less than n, G is the level of the sub-library of the difference frequency special vocabulary library to which the vocabulary belongs, the value of G is an integer, the theme values of all the theme words in the previous n character sentences are calculated, and then the theme words are queued according to the theme values from large to small to obtain the theme word queue.

Preferably, the difference frequency-specific word library unit includes: the first-level, second-level, third-level and fourth-level sub-library modules are used for storing first-level, second-level, third-level and fourth-level difference frequency vocabularies and difference frequency values thereof, and vocabularies with higher difference frequency values in the same-level sub-library are queued to be ahead in the sub-libraries;

the vocabulary and difference frequency values in the first, second, third and fourth level sub-library modules are obtained and updated by a construction unit, and the construction unit comprises:

the text data acquisition module is used for acquiring text data comprising local professional files, conversation texts, chat texts and keyboard input historical records and searching professional articles on the network, wherein the conversation texts are obtained by the first voice-to-text unit and the second voice-to-text unit and are continuously provided for the text data acquisition module;

the special word frequency dictionary module is used for cleaning and segmenting the collected character data to obtain a word list, and then carrying out special word frequency statistics on the word list and storing the word list; wherein, the special word frequency is the repeated times of the word multiplied by the word length/total number of words of all data;

the universal word frequency dictionary module is used for carrying out word segmentation operation on news data of three websites including a national daily newspaper corpus, a Xinlang, a search fox and a network accessibility to obtain a word list, and then carrying out universal word frequency statistics and storage on the word list, wherein the universal word frequency is the number of times that the word is repeated multiplied by the word length/the total number of words of all data;

a difference frequency operation module, configured to perform difference frequency operation on each vocabulary of the special word frequency dictionary, where the difference frequency operation is:

the difference frequency value is the special word frequency-k multiplied by the general word frequency of a vocabulary, wherein k is a fixed coefficient;

and the difference frequency distribution module is used for storing 25% of the vocabulary ranked at the top of the difference frequency value into the first-level sub-library module, storing 26% to 50% of the vocabulary into the second-level sub-library module, storing 51% to 75% of the vocabulary into the third-level sub-library module, and storing other vocabularies larger than 0 into the fourth level, wherein the difference frequency value is less than or equal to 0 and is omitted. Preferably, the first speech-to-text unit is the same as the second speech-to-text unit, and both include the following modules:

a level priority matching module: the phonetic transcription is converted into pinyin to obtain a pinyin string consisting of letters and tones, the name of the pinyin string is A, the pinyin string is preferentially matched with the pinyin of the vocabulary stored in the primary sub-library module of the difference frequency special word library unit in the process that the A is changed into characters, if the matching is successful, part of the pinyin of the A is changed into characters, and if the matching is unsuccessful, the next stage is considered until the last sub-library module is reached;

a frequency priority matching module: after the level priority matching module finishes matching, matching the rest pinyin of the A with the pinyin of the universal vocabulary, preferentially matching the non-special vocabulary with high frequency in the universal data, and finally matching the rest pinyin with the pinyin of a single Chinese character;

the subject word matching module is used for matching the subject words before the level priority matching module, matching A with a subject word queue, starting from the first subject word in the queue, changing partial pinyin of A into characters if the matching is successful, and considering the next subject word until the last subject word in the queue if the matching is unsuccessful;

wherein the matching is implemented by two modules including:

a phoneme edit distance calculating module: the phoneme editing distance refers to the minimum number of phoneme editing operations required for converting one pinyin string into another pinyin string, the phoneme refers to the initial consonant or final consonant of the pinyin, and the permitted editing operations include: inserting an initial consonant/final, deleting an initial consonant/final, replacing one initial consonant/final with another one, wherein the replacement between fuzzy tones is calculated for 0.5 time, and the operations do not comprise tones;

a judgment output module: if the matched words are special words, outputting a phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold, otherwise, giving a matching failure signal; if the matched words are universal words, outputting the phoneme editing distance;

wherein the level priority matching module comprises:

a reverse word-taking module: the pinyin of the vocabulary with the highest difference frequency value is taken from the unmatched vocabularies in the first-stage sub-library module, the name of the pinyin is B, and if the vocabularies in the first-stage sub-library module are matched, the pinyin is extended to the next-stage sub-library module;

any position conversion pinyin module: and searching substrings C similar to B in A, and if B is successfully matched with C, converting C into corresponding Chinese vocabulary. If a plurality of substrings similar to B exist in A, repeating the above operations; the substring C can be located at any position of a.

Preferably, the response generation unit includes:

the query module: the input of the input unit is the output of a second voice-to-character unit, namely, a character sentence of a question asked by a questioner is set as A2, all words in A2 are used as a word set A2S and output to a question and answer collecting library module for query, the words comprise subject words, difference frequency words and universal words, and A2S is obtained by a voice-to-character module;

a contact ratio calculation module: if the coincidence degree of B2S and A2S is larger than a set threshold value, the answer character sentences corresponding to B2 in the question-answer summary library are stored in an answer sequence, so that a plurality of answers are obtained and stored in the answer sequence until the question-answer summary library is searched;

a sorting module: the answer sequences are arranged from large to small according to the contact ratio, and the first answer of the sequence is output to the voice synthesis unit to synthesize the voice; if the answer sequence is empty, outputting a signal requesting answering of human intervention;

a question-answer summary library: the system is used for storing the questions asked by the questioner, the answers to the questions and the vocabulary sets of the questions and the answers for the questioner to inquire by the inquiry module;

an online learning module: the method comprises the steps that character sentences answered by respondents are input, the answers and vocabulary sets thereof are used as answers and stored in a question and answer summarizing library, question characters corresponding to the answers and vocabulary sets thereof are stored at the same time, and the online learning module is started only when questions are manually asked and answered;

preferably, the degree of coincidence is calculated as follows:

setting p subject words in the vocabulary set of two character sentences to be the same, and sequencing the subject words from high to low according to the subject values: subject word 1, subject word 2 …, subject word p; and then setting r difference frequency words in the two word sets to be the same, and sequencing according to the difference frequency values from high to low: difference frequency vocabulary 1 and difference frequency vocabulary 2 …; if j common vocabularies in the two vocabulary sets are the same, then:

the contact ratio of the two vocabulary sets is T1+ T2+ … + Tp + Q1+ Q2+ … + Qr + U1+ U2+ … + Uj;

here, T1, T2 … Tp, Q1, Q2 … Qr, U1, U2 … Uj are preset weight coefficients;

the preset weight coefficients T1, T2 … Tp, Q1, Q2 … Qr, U1 and U2 … Uj satisfy the following conditions:

t1 ≧ T2 ≧ … ≧ Tp ≧ Q1 ≧ Q2 ≧ … ≧ Qr ≧ U; here, U represents any one of U1, U2 … Uj.

The second purpose of the invention is realized by the following technical scheme: a method for converting voice into text for online learning voice recognition answering device comprises the following steps:

s1, converting the voice into pinyin: analyzing and identifying the digitized signal of the voice, and obtaining a whole sentence pinyin A corresponding to the voice;

s2, carrying out subject word matching on the A;

s3, performing level-first matching on the rest Pinyin of A;

s4, performing frequency priority matching on the rest Pinyin of A;

s5, matching the rest pinyin of the A with a single Chinese character to obtain a complete sentence text;

s6, outputting the sentence characters; and outputting the vocabulary classification obtained by matching of S2, S3, S4 and S5 to the subject word sharing unit, the difference frequency special word bank and the general word frequency dictionary to refresh the subject word queue, the difference frequency value and the sequence and the vocabulary frequency, and outputting the vocabulary classification as a set to the response generating unit.

The third purpose of the invention is realized by the following technical scheme: an automatic character answering method for on-line learning of an on-line learning voice recognition answering device, which is used for automatically answering character questions of a questioner on a network and automatically learning answers of the answerer, and comprises the following steps:

and (3) extracting vocabularies: setting an input sentence as a segment of character sentence A2, and utilizing a subject word sharing unit, a difference frequency vocabulary library and a general vocabulary library to perform A2 word segmentation to obtain a vocabulary set A2S; the vocabulary set comprises subject words, difference frequency vocabularies and universal vocabularies, and the questioner refers to a person who asks a question and comprises a client;

inquiring: A2S is output to a question-answer summary library for query;

and (3) calculating the contact ratio: if the coincidence degree of B2S and A2S is larger than a set threshold value, storing answer characters corresponding to B2 in the question-answer summary library into an answer sequence, obtaining a plurality of answers by the query step, and storing the answers into the answer sequence until the question-answer summary library is searched;

sorting: the answer sequences are arranged from large to small according to the contact ratio, and the first answer of the sequence is output to the questioner; if the answer sequence is empty, outputting a signal for requesting the intervention of an answering person, wherein the answering person refers to a person answering the question and comprises manual customer service;

a question-answer summary library: the vocabulary set is used for storing the questions asked by the questioner, the answers of the questions and the vocabulary set of the questions and the answers for the inquiring step;

and (3) online learning: inputting answer character sentences of respondents, storing the answers and word sets thereof as answers in a question-answer summarizing library, and simultaneously storing question characters and word sets thereof corresponding to the answers, wherein the on-line learning step is started only when questions are manually asked and answered;

the length of the answer sequence is 1, that is, only the answer with the largest contact ratio is reserved, when the contact ratio of the new answer is larger than that of the answer reserved in the sequence, the old answer is replaced by the new answer, otherwise, the answer is not changed.

The fourth purpose of the invention is realized by the following technical scheme: a method for on-line learning a speech recognition answering machine to extract words from sentences for searching and to order the results, comprising:

and (3) extracting vocabularies: the input sentence is a text sentence A2, a subject word sharing unit, a difference frequency vocabulary library and a general vocabulary library are utilized to divide A2 words to obtain a vocabulary set A2S, and the vocabulary set contains difference frequency vocabulary and general vocabulary;

searching: searching a network or local database by A2S to obtain a plurality of results C1, C2, … Ci, … Cm, Ci represents i results;

and (3) calculating the contact ratio: let Ci contain r difference frequency words in A2S, and sort according to their difference frequency values from high to low: difference frequency vocabulary 1 and difference frequency vocabulary 2 …; if Ci also contains j common words in A2S, then:

the coincidence ratio of Ci is Q1+ Q2+ … + Qr + U1+ U2+ … + Uj; here Q1, Q2 … Qr, U1, U2 … Uj are preset coefficients;

and (4) sequencing results: after the contact ratio of C1, C2, … Ci and … Cm is calculated, the contact ratios are reordered from high to low according to the contact ratio and output;

the preset coefficients Q1, Q2 … Qr, U1, U2 … Uj satisfy the following conditions: q1 ≧ Q2 ≧ … ≧ Qr ≧ U, where U denotes any one of U1 and U2 … Uj;

the word segmentation comprises the following steps:

level-first comparison: setting a segment of text A2, preferentially comparing the text A2 with the words of the words stored in the first-level sub-library module of the difference frequency special word library unit, dividing part of the words of A2 into a word if the comparison is successful, storing the word into a word collection named A2S, and considering the next level if the comparison is unsuccessful until the last-level sub-library module and the words in A2S are sequentially ordered according to the division time; the comparison is to obtain the character similarity between a part of characters in A2 and a Chinese vocabulary in a vocabulary library;

frequency-first comparison: after the level priority comparison is completed, the rest characters of the A are compared with the characters of the universal words, the non-special words with high frequency in the universal data are preferentially compared, and finally the rest characters are stored in A2S;

the comparing comprises the following steps:

reverse word extraction: taking the vocabulary with the highest difference frequency value from the vocabulary which is not compared in the first-stage sub-library module, and setting the name of the vocabulary as D, if the vocabulary in the first-stage sub-library module is compared, continuing to the next-stage sub-library module;

dividing any position: searching a character string E similar to D in A2, if the E is successfully compared with D, dividing E into corresponding Chinese words, and if a plurality of substrings similar to D exist in A2, repeating the above operations; the E can be located at any position of a 2.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. 2 speech recognition functions required by the telephone customer service automatic answering system can be realized on a common desktop computer: and the voice recognition of the call of the client and the call of the manual customer service personnel are realized, a plurality of voice recognition devices are not needed, and the cost is saved.

2. The invention can automatically distinguish universal words and special words, especially local special words, so that manual library building of each regional department is not needed, the special words are stored in a hierarchical difference frequency special word library and are continuously refreshed, updated and replaced, and a great amount of time and energy of customer service staff are saved.

3. The invention takes the priority matching of the level and the special vocabulary as the key priority matching, reduces the error caused by the prior voice recognition method which has hot universal vocabulary, thereby improving the accuracy of the voice recognition.

4. The invention can learn the answer of the manual customer service on line, while learning, continuously supplement the existing answer library and better deal with various problems of the client.

5. The contact ratio method of the invention can distinguish the importance of each vocabulary in the sentence and more accurately match the answers.

Drawings

Fig. 1 is a block diagram showing the structure of an online learning speech recognition response device.

Fig. 2 is a flow chart of an automatic text response method for online learning.

Fig. 3 is a flow chart of a voice-to-text process.

Detailed Description

In addition to the automatic answering customer service system, the device of the present invention can also be applied to an expert consulting system, an electric power intelligent dispatching system, a telephone decision support system and a remote disease diagnosis system, wherein one person mainly proposes questions or reports the specific situation of the site to seek corresponding countermeasures, and the other person mainly answers questions or gives decisions, such as the conversation between an electric power dispatcher and a site overhaul operator, the conversation between a commander of a decision center and a site operator, the remote conversation diagnosis between a doctor and a patient, and the like, so for the sake of unification, hereinafter, the devices are uniformly called questioner and answerer, and the answer of the answerer is not necessarily the final answer, and may be provided to the questioner for some further choice or question to guide the questioner to more clearly describe the question; a text sentence refers to a segment of text of unlimited length.

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Referring to fig. 1, the embodiment discloses an online learning voice recognition response device, which includes:

the system comprises a telephone monitor, a first impedance matching device K1, a second impedance matching device K2, a sound card, a first voice-to-text unit M1, a second voice-to-text unit M2, a difference frequency special word library unit, a construction unit, a subject word sharing unit, a response generation unit and a voice synthesis unit.

The telephone monitor and the telephone set are connected in parallel on the same telephone line, and 2-path analog voice signals for talking between a answering person (such as a man-made customer service person) and a questioning person (such as a customer) are obtained and respectively and correspondingly output to the first impedance matching device K1 and the second impedance matching device K2. The telephone monitor does not influence manual call receiving and calling, and is convenient for manual customer service to intervene at any time.

The impedance of the first impedance matching device K1 and the impedance of the second impedance matching device K2 can be adjusted, so that the intensity of the input analog voice signal changes to adapt to the signal intensity requirement of the Line in interface of the sound card, and the first impedance matching device K1 and the second impedance matching device K2 respectively correspond to the first Line in interface 1 and the second Line in interface 2 which are output to the sound card. Of course, if the strength of the analog voice signal is just within the range of adaptation of the sound card, the impedance matching device may not be used.

In fig. 1, the sound card includes a first Line in interface 1 and a second Line in interface 2, the 2 input interfaces respectively receive 2 analog signals of the answering person conversation voice and the questioner conversation voice, and convert the analog signals into 2 digital voice signals through 2 analog/digital circuits of the sound card, wherein the digital signals of the questioner voice are output to a second voice to text unit M2, and the answering person is output to a first voice to text unit M1.

In fig. 1, a first speech-to-text unit M1 receives the digital speech signal of the answer-to-speech, recognizes it as corresponding words and outputs them, and these words are simultaneously used as the input of the construction unit for updating the difference frequency vocabulary and the difference frequency value; these words are also used as input of the shared unit of formation subject words, which is used to extract the subject words of the telephone call between the answering person and the questioner.

In fig. 1, the second speech-to-text unit M2 receives the digitized signal of the questioner speech, recognizes it as corresponding text and outputs it; the characters can be simultaneously used as the input of a construction unit and used for updating difference frequency words and difference frequency values; the characters can be used as the input of the subject word sharing unit at the same time for extracting the subject words in the telephone call text of the answering person and the questioner, and because M1 and M2 share the difference frequency special word bank unit and the subject word sharing unit, the two processes of M1 and M2 can work independently, can interact and complement each other, and are promoted together, thereby improving the recognition accuracy. Details of M1 and M2 are further detailed below.

The sound card, the first voice-to-text unit M1, the second voice-to-text unit M2, the response generation unit and the voice synthesis unit are all built in the same computer, and the first voice-to-text unit M1 and the second voice-to-text unit M2 are respectively realized by two cores of a CPU of the computer in parallel.

The importance of the special vocabulary in the telephone answering device is higher than that of the general vocabulary, so that the recognition rate is preferentially ensured, a special vocabulary library is established, and further, the matching similarity is affected by the noisy environment, so that the special vocabulary with high grade is preferentially matched under the condition that the matching similarity is not very different.

The difference frequency special word bank unit is used for storing graded special words and pinyin thereof so as to be inquired by the two voice-to-character units, so that the matching accuracy of the special words is improved, the level of the words is determined by the difference frequency value of the words, the words refer to Chinese words, all short words and other names of the words are stored together with the words and are regarded as the same words, the special words comprise local special words and professional terms, the local special words refer to words only used in a local machine, a local area network, a specific region, a group or a department, the special words at the same level are stored in the same sub-bank, the highest sub-bank is a first-level sub-bank, the highest sub-bank is a second-level to fourth-level sub-bank in sequence below and is used for storing first-level, second-level, third-level and fourth-level difference frequency words and difference frequency values thereof, and the words with higher difference frequency values in the same level sub-bank are queued ahead in the sub-bank.

In addition, the device can automatically build a difference frequency special word bank through a program. To automatically distinguish specialized words from ordinary words, differences must be utilized. Specialized vocabulary, particularly local specialized vocabulary, such as the customer's question "what function is a green flagship version of laser rangefinder? "here," green light flagship edition "is a special word, which generally does not appear in general news or articles, but may appear in local documents, local browser records, local chat records, local keyboard input records, local store equipment records, local call text records, etc., on the contrary, general words such as" functions "frequently appear in general articles or online documents, and in addition, a professional word" laser rangefinder "may appear in local documents, academic articles, and news reports, so the present patent proposes: the vocabulary level is determined by the difference between the two frequencies, i.e., the higher the frequency of occurrence in the specific data, the higher the vocabulary level, and the higher the frequency of occurrence in the general data, the lower the vocabulary level.

The construction unit is used for automatically constructing the difference frequency special word bank and updating the vocabulary and the difference frequency value in the difference frequency special word bank unit, and comprises the following steps:

1) the text data acquisition module is used for acquiring text data comprising local files, local browser records, local chat records, local keyboard input records, local store equipment records, local call text records and the like, searching professional academic articles on the network, wherein the call texts are obtained by the first voice-to-text unit M1 and the second voice-to-text unit M2 and are continuously provided for the text data acquisition module;

2) the special word frequency dictionary module is used for cleaning and segmenting the collected text data to obtain a vocabulary list, and then carrying out special word frequency statistics on the vocabulary list and storing the vocabulary list; wherein, the special word frequency is the repeated times of the word multiplied by the word length/total number of words of all data;

3) the universal word frequency dictionary module is used for carrying out word segmentation operation on a people daily newspaper corpus and news of three websites of new wave, fox search and network accessibility to obtain a word list, and then carrying out universal word frequency statistics and storage on the word list, wherein the universal word frequency is the number of times of repeated words multiplied by the word length/the total number of words of all data; (ii) a

4) A difference frequency operation module, configured to perform difference frequency operation on each vocabulary of the special word frequency dictionary, where the difference frequency operation is:

5) and the difference frequency distribution module is used for storing 25% of the vocabulary ranked at the top of the difference frequency value into the first-level sub-library module, 26% to 50% of the vocabulary into the second-level sub-library module, 51% to 75% of the vocabulary into the third-level sub-library module, and storing other vocabularies larger than 0 into the fourth level, wherein the difference frequency value is less than or equal to 0 and is omitted.

When a client and a customer service person carry out voice conversation and communication, large background noise exists often, so that the accuracy of voice recognition is seriously reduced. In a noisy environment, some words and phrases may not be clearly heard, and people can often guess some words and phrases which are not clearly heard from the context of a conversation, but the current speech recognition algorithm only considers the recognition of single-sentence speech and cannot utilize coherent subject semantics in the context of the conversation, which is also the weakness of the current speech recognition algorithm. A preferable scheme is that subject word matching is added before level priority matching, so that the subject of the conversation is clarified, and the recognition rate of the whole conversation can be improved.

The topic word sharing unit in fig. 1 is used for extracting topic words in existing dialog texts of a questioner (customer) and an answerer (customer service person), providing the topic words to a first speech-to-text unit M1 and a second speech-to-text unit M2 for query, so as to improve the recognition rate of subsequent dialogs, and includes the following modules:

1) subject term determination module: counting repeated vocabularies and repeated times of the repeated vocabularies; if the repeated vocabulary is the special vocabulary for difference frequency, adding the vocabulary into a subject word queue, otherwise, discarding the vocabulary, wherein the front is the character obtained by converting the existing dialogue voice by the first voice-to-character unit M1 and the second voice-to-character unit M2;

2) the subject word queue ordering module: if n dialog sentences are recognized from the beginning of the speech recognition to the current sentence to be recognized, and the current sentence to be recognized is numbered as the (n + 1) th sentence, the theme value of a repeated vocabulary is as follows:

wherein i and j are repeated when the vocabulary is in the i and j sentences, the ellipses represent other repeated sentences, i and j are less than n, and G is the level of the sub-library of the difference frequency special vocabulary library to which the vocabulary belongs and the value of G is an integer from 1 to 4. Calculating the subject values of all subject words in the first n sentences, and queuing according to the subject values from large to small to obtain a subject word queue;

the first M1 and the second M2 phonetic-to-text units in FIG. 1 are the same and include the following modules:

1) subject word matching module: the method comprises the following steps that a phonetic string consisting of letters and tones is obtained after voice is converted into pinyin, the name of the phonetic string is set to be A, in the process that A becomes a character, subject word matching is carried out firstly, the A is matched with a subject word queue, from the first subject word in the queue, if matching is successful, partial pinyin of A becomes a character, and if matching is unsuccessful, the next subject word is considered until the last subject word in the queue; the module is started only when the questioner and the answerer have a telephone conversation, otherwise, the module directly enters the level priority matching module.

2) A level priority matching module: after the subject words are matched, the rest Pinyin of the A is preferentially matched with the Pinyin of the vocabulary stored in the first-level sub-library module of the difference frequency special word library unit, if the matching is successful, part of Pinyin of the A is changed into characters, and if the matching is unsuccessful, the next level is considered until the last four levels of sub-library modules are reached; the level priority matching module comprises two sub-modules: the reverse word-taking module is used for taking the pinyin of the vocabulary with the highest difference frequency value from the unmatched vocabularies in the first-stage sub-library module, setting the name of the pinyin as B, and sequentially extending the pinyin to the next-stage sub-library module if the vocabularies in the first-stage sub-library module are matched; and the random position conversion pinyin module searches substrings C similar to B in the A, and if the B is successfully matched with the C, the C is converted into corresponding Chinese words. If A has a plurality of substrings similar to B, repeating the above operations; the substring C can be located at any position of a.

3) A frequency priority matching module: after the level priority matching module finishes matching, the remaining pinyin of the A is matched with the pinyin of the universal vocabulary, the non-special vocabulary with high frequency in the universal data is matched with the priority, and finally the remaining pinyin is matched with the pinyin of a single Chinese character.

The matching used in the unit of converting the voice into the text is realized by a matching module, the matching of pinyin, vocabulary and characters can be realized according to a known method, and the invention provides a preferred matching scheme which comprises the following steps:

1) a phoneme edit distance calculating module: the minimum number of phoneme editing operations required for converting one pinyin string into the other pinyin string, wherein the phoneme refers to the initial consonant or the final of the pinyin, and the allowable editing operations comprise: inserting an initial consonant/final, deleting an initial consonant/final, replacing one initial consonant/final with another, wherein the replacement between fuzzy tones only counts for 0.5 times; example (c): assuming that the Yue Pong station 'yue 4 tan2 zhan 4' says 'yue 4 tan2 zhan 4' because the mandarin of the speaker is not standard, correct pinyin can be obtained by replacing a vowel ang, and here an and ang are fuzzy sounds of each other, so that the phoneme editing distance is 0.5.

2) A judgment output module: if the matched words are special words, outputting a phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold, otherwise, giving a matching failure signal; and if the matched universal vocabulary is matched, outputting the phoneme editing distance.

The tone of the pinyin is not considered, because the dialects of China are numerous and the pronunciation of each place is greatly different, many people are difficult to distinguish the tone, and the tone is influenced by the change of the tone and the tone.

The answer generating unit in fig. 1 is a core of the apparatus, and is configured to generate an answer, and includes:

the query module: the input is the question text asked by the questioner, as in the above example, the question "what function the green light flagship version laser rangefinder has? "it is A2, because the subject word, difference frequency vocabulary and common vocabulary in A2 are obtained in the matching process of voice to text, these vocabularies are arranged in time sequence, that is, the vocabulary set A2S of A2 ═ green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (common vocabulary), what (common vocabulary), have (common vocabulary) }, and output to the query and answer collection library module for query, and the text sentence B2 and vocabulary set B2S of a certain question stored in the query and answer collection library ═ green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (common vocabulary) }, and the coincidence degree of B2S and A2S is greater than the set threshold, then the answer text sentence corresponding to B2 in the query and answer collection is { green light flagship version laser range finder has three functions: countdown measurement, universal horizontal bubble measurement and secondary stock-colluding measurement } into an answer sequence, continuously searching a question-answer summary library by using A2S, and storing a plurality of obtained answers into the answer sequence until the search of the question-answer summary library is finished; the answer sequences are arranged from large to small according to the contact ratio, and the first answer of the sequence is output to the voice synthesis unit to synthesize the voice; if the answer sequence is empty, outputting a signal for requesting answering human intervention, answering the question by a manual customer service, and starting a learning module;

a learning module: the input is the answer characters of the answerer, the answer and the vocabulary set thereof are taken as the answer to be stored in a question-answer summarizing library, and the question characters and the vocabulary set thereof corresponding to the answer are simultaneously stored, and the learning module is started only when the question is manually asked and answered; in the above example, assuming that none of the answer libraries match A2S, the answer sequence is empty and the human customer service sees a request for intervention signal, she picks up the phone to answer "Green flagship version laser rangefinder with three functions: countdown measurement, universal horizontal bubble measurement and secondary collusion measurement ", the voice is converted into a text sentence by the second voice-to-text unit M2 and is stored as an answer in the question-answer summary library, and A2S is correspondingly stored in the question-answer summary library and is bound into a pair of questions and answers.

The contact ratio in the query module is calculated as follows:

the contact ratio of the two vocabulary sets is T1+ T2+ … + Tp + Q1+ Q2+ … + Qr + U1+ U2+ … + Uj. Here, T1, T2 … Tp, Q1, Q2 … Qr, U1, and U2 … Uj are preset weight coefficients. The more important words have higher weights, so: t1 ≧ T2 ≧ … ≧ Tp ≧ Q1 ≧ Q2 ≧ Qr ≧ U, where U denotes any one of U1 and U2 … Uj.

Example (c): A2S ═ green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary), what (general vocabulary), have (general vocabulary) }, B2S ═ green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary) }, coincident vocabulary: green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary), contact ratio 0.6+0.3+0.1 ═ 1. In the example, the 'green light flagship edition' is the most important vocabulary, and the 'laser range finder' is the second most important vocabulary, but the 'function' is a general common vocabulary which is unimportant and has the minimum contribution to the query, so that a sentence can be decomposed according to the importance of the vocabulary through the calculation of the contact ratio, and the search can be carried out like using keywords, and various network search engines used at present need to manually input the keywords, cannot search by sentences, need a certain skill, and are not friendly to beginners.

Furthermore, the response generating unit of the invention can be independently used as a character automatic response function, and when the input is a character, a related device of voice recognition is not needed, so that the invention has a plurality of new application scenes.

The first new application is network chat question-answering, such as the eariwang chat tool of Taobao, where both the client and the customer service are manually entered with words, and at this time, the answer generation method is slightly different from the answer generation unit in FIG. 1, and the flow of the method is shown in FIG. 2. in this application, because there is no step of converting the words into speech, the subject words, difference frequency words and general words need to be extracted from the words and sentences entered by the client, the method is similar to Pinyin matching, except that word comparison is used instead of Pinyin matching, for example, what function is the word and sentence A2 "green flagship version laser rangefinder entered by the client? ", the difference vocabulary in the sentence is ordered according to the difference value as: the green light flagship version is larger than the laser range finder, and the others are universal words. 1) Reverse word extraction: words are taken from the primary sub-library one by one according to the difference frequency value, and whether a substring which is successfully compared exists is searched in A2 every time a word is taken; 2) dividing any position: the existing method starts word segmentation from the first character, the method is different, the substrings can be segmented at any position of the string A2, if the matching difference is larger than a given threshold value, the substrings are abandoned, and then the latter word is taken until the green flagship edition is obtained, the corresponding parts in the A2 can be compared, so that the A2 becomes [ 2]Green light flagship boardWhat function the laser rangefinder has]. The reverse word extraction and the arbitrary position division are specially designed for the special vocabulary of the difference frequency, and are different from the currently known method. Similarly, the remaining specialized vocabulary of string A2 would then be divided, and finally the common vocabulary: what function the green light flagship edition laser range finder has]. The following steps are compared with the response generation unit in fig. 1The same is true.

As shown in fig. 2, the automatic text answering method for online learning provided by the present invention is used for automatically answering text questions of a questioner on a network and automatically learning answers of the questioner, and includes:

and (3) inquiring: A2S is output to a question-answer summary library for query;

In addition, the automatic character response method can also be used for equipment such as an intelligent sound box and the like which utilizes voice sentence searching.

The second new application is a method for extracting words from sentences to search and ordering results, which comprises the following steps:

and (3) extracting vocabularies: the input sentence is a text sentence A2, the A2 is divided into words by using a subject word sharing unit, a difference frequency word library and a general word library to obtain a word set A2S, the word set contains difference frequency words and general words, and the word dividing method is the first new application.

Searching: searching a network or local database by A2S to obtain a plurality of results C1, C2, … Ci, … Cm;

the coincidence ratio of Ci is Q1+ Q2+ … + Qr + U1+ U2+ … + Uj; here Q1, Q2 … Qr, U1, U2 … Uj are preset coefficients.

And (4) sequencing results: after the contact ratio is calculated for C1, C2, … Ci and … Cm, the contact ratios are reordered from high to low according to the contact ratio and output.

Fig. 3 is a specific method and flow for converting speech into text, including the following steps:

s1, converting the voice into pinyin. And analyzing and recognizing the digitized signal of the voice by adopting a known deep learning voice recognition algorithm, and obtaining the whole sentence pinyin corresponding to the voice. For example: in the power telephone dispatching response system, the dispatcher answers the voice of the field operator as follows: when the Chinese character 'Yuntangyue Xiang stone line 35 grounding switch and 36 grounding switch' are thrown into the Yuntangyue stone line, the conversion of the step S3 can obtain [ tou2 ru4 yue4 tan2 zhan4 yue4 gang1 xiang1 shi2 xian4 san1 wu3 jie1 di4 dao1 zha2 he2 san1 liu4 jie1 di4 dao1 zha2], and the Chinese character is called Pinyin string A;

s2, a subject term matching module: inquiring the subject term sharing unit, matching A with a subject term queue in the subject term sharing unit, starting from the first subject term in the queue, if the matching is successful, changing partial pinyin of A into characters, and if the matching is unsuccessful, considering the next subject term until the last subject term in the queue;

s3, using the level priority matching module to match the Chinese text for the rest Pinyin in A, and needing to inquire the difference frequency special word stock. For example, the Yue Pond station, the Yue Steel Xiang Stone line and the earthing knife switch are special vocabularies, and the difference frequency values are sorted: yue pond station (level 1), Yue Steel Xiang stone line (level 2) and earthing knife switch (level 3). 1) Reverse word extraction: and (4) taking words from the primary sub-library one by one according to the difference frequency value, and searching whether a matched sub-string exists in the pinyin string A or not every time one word is taken. The current matching method is to fetch pinyin from a string A and search the pinyin in a vocabulary library, and the method of the patent is opposite to the method, so the method is called reverse word fetching; 2) conversion of any position: the existing method starts to convert characters from a first letter, the method is different, a sub-string can be converted at any position of a string A, if the matching difference is larger than a given threshold value, the sub-string is abandoned, and then a later word is taken until the pinyin 'yue 4 tan2 zhan 4' at the Yuanttang station can be matched with the corresponding part in the pinyin string A, so that the pinyin string A becomes [ tou2 ru4 Yuanttang yue4 gang1 xing 1 shi2 xian4 san1 wu3 jie1 di4 dao1 zha2 he2 san1 liu4 jie1 di4 dao1 zha2 ]. The reverse word extraction and the arbitrary position conversion are specially designed for the special vocabulary of the difference frequency, and are different from the currently known method. Similarly, the remaining specialized vocabulary for string A would then be converted: [ tou2 Ru4 Yue Pond Yuan Steel Xiang stone line san1 wu3 earthing knife brake he2 san1 liu4 earthing knife brake ];

and S5, matching the rest Pinyin and the universal vocabulary in the A by the frequency priority matching module. When all the special words in the string A are converted, the universal words are matched according to a known frequency priority method: according to the sequence from front to back, tou2 ru4 is taken, the general dictionary is looked up, the 'investment' is obtained, and the string A becomes: [ put into Yue pond standing Yue Steel Xiang stone line san1 wu3 earthing knife brake he2 san1 liu4 earthing knife brake ];

s6, matching the rest pinyin with a single Chinese character to obtain a complete sentence text (throwing into a Yue Pont Yuanlingxiang stone line 35 earthing switch and a 36 earthing switch);

s7, outputting the sentence characters; outputting the whole sentence of characters; and the vocabulary classification obtained by matching S2, S3, S4 and S5 is output to the subject word sharing module, the difference frequency special word bank and the general word frequency dictionary to refresh the subject word queue, the difference frequency value and the sequence and the vocabulary frequency. Example (c): refreshing difference frequency values of difference frequency vocabulary Yue pond stations and Yue Steel Xiang stone lines and updating the sequence of the difference frequency vocabulary in a difference frequency vocabulary library, wherein the difference frequency vocabularies which do not appear do not need to be refreshed frequently; if the words still appear in the previous sentence, the subject word queue is refreshed, and if the previous sentence does not exist, the words are newly added into the queue and put at the end.

S8, if the voice is continuously input, turning to S1, otherwise, turning to the next step;

and S9, ending.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An online learning voice recognition answering device, comprising: the system comprises a voice-to-character module, a response generation unit and a voice synthesis unit;

2. The device for learning and responding by speech recognition on line according to claim 1, further comprising: telephone listeners, sound cards; the voice-to-text module comprises two subunits capable of independently and simultaneously working: a first voice-to-text unit and a second voice-to-text unit;

3. The device for learning and responding by speech recognition on line according to claim 1, further comprising:

the difference frequency special word bank unit is used for storing graded special words and pinyin thereof so as to be used for inquiring the voice-to-character unit, thereby improving the matching accuracy of the special words, the level of the words is determined by the difference of two frequencies of the words, namely the higher the frequency of the words appearing in special data is, the higher the level of the words is, and the higher the frequency of the words appearing in general data is, the lower the level of the words is, the words refer to Chinese words, all the acronyms and other names of the words are stored together with the words, and the same word is calculated, and the special words comprise local special words and professional terms; the local special vocabulary refers to vocabulary used only in a local machine, a local area network and a specific region, group or department, the special vocabulary at the same level is stored in the same sub-library, the highest sub-library is a first-level sub-library, and the highest sub-library is sequentially a second-level sub-library to the lowest sub-library, and the vocabulary stored in the difference frequency special vocabulary library unit is called as difference frequency special vocabulary or difference frequency vocabulary;

wherein i and j are repeated when the vocabulary is in the i and j sentences, the ellipses represent other repeated character sentences, i, j is less than n, G is the level of the sub-library of the difference frequency special vocabulary library to which the vocabulary belongs, the value of G is an integer, the subject values of all subject words in the previous n character sentences are calculated, and then the subject words are queued according to the subject values from large to small to obtain a subject word queue.

4. The apparatus as claimed in claim 3, wherein the difference frequency-specific lexicon unit comprises: the first-level, second-level, third-level and fourth-level sub-library modules are used for storing first-level, second-level, third-level and fourth-level difference frequency vocabularies and difference frequency values thereof, and vocabularies with higher difference frequency values in the same-level sub-library are queued to be ahead in the sub-libraries;

and the difference frequency distribution module is used for storing 25% of the vocabulary ranked at the top of the difference frequency value into the first-level sub-library module, storing 26% to 50% of the vocabulary into the second-level sub-library module, storing 51% to 75% of the vocabulary into the third-level sub-library module, and storing other vocabularies larger than 0 into the fourth level, wherein the difference frequency value is less than or equal to 0 and is omitted.

5. The device of claim 2, wherein the first speech-to-text unit is the same as the second speech-to-text unit, and comprises the following modules:

the subject word matching module is used for matching the subject words before the level priority matching module, matching A with a subject word queue, starting from the first subject word in the queue, changing partial pinyin of A into characters if the matching is successful, and considering the next subject word until the last subject word in the queue if the matching is unsuccessful; wherein the matching is implemented by two modules including:

the phoneme editing distance calculating module: the phoneme editing distance refers to the minimum number of phoneme editing operations required for converting one pinyin string into another pinyin string, the phoneme refers to the initial consonant or final consonant of the pinyin, and the permitted editing operations include: inserting an initial consonant/final, deleting an initial consonant/final, replacing one initial consonant/final with another initial consonant/final, wherein the replacement between fuzzy tones is calculated for 0.5 time, and the operations do not include tones;

a judgment output module: if the matched words are special words, outputting a phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold value, otherwise, giving a matching failure signal; if the matched words are universal words, outputting the phoneme editing distance;

wherein the level priority matching module comprises:

6. The apparatus for learning and responding by speech recognition according to claim 1, wherein the response generating unit comprises:

a sorting module: the answer sequences are arranged from large to small according to the contact ratio, and the first answer of the sequence is output to the voice synthesis unit to synthesize and sound; if the answer sequence is empty, outputting a signal requesting answering of human intervention;

7. the device for learning and responding by speech recognition according to claim 6, wherein the contact ratio is calculated as follows:

8. The method for converting speech into text for the on-line learning speech recognition answering device of any one of claims 1-7, comprising the steps of:

s2, performing subject word matching on the A;

s3, performing level-first matching on the rest Pinyin of A;

s4, performing frequency priority matching on the rest Pinyin of A;

9. The automatic character answering method for on-line learning of the on-line learning voice recognition answering device of any one of claims 1 to 7, which is used for automatically answering a character question of a questioner on a network and automatically learning an answer of the answerer, comprising:

and (3) inquiring: A2S is output to a question and answer summary library for query;

and (3) online learning: the input is the answer words and sentences of the answerer, the answer and the vocabulary set are used as the answer to be stored in a question-answer collecting library, the question words and the vocabulary set corresponding to the answer are simultaneously stored, and the on-line learning step is started only when the question is manually asked and answered.

10. The method for on-line learning of the speech recognition answering machine to extract words from sentences and to search and order the results according to any one of claims 1-7, comprising:

and (3) extracting vocabularies: setting an input sentence as a segment of character sentence A2, utilizing a subject word sharing unit, a difference frequency vocabulary library and a general vocabulary library to perform A2 word segmentation to obtain a vocabulary set A2S, wherein the vocabulary set contains difference frequency vocabulary and general vocabulary;

and (4) sequencing results: after the contact ratio is calculated for C1, C2, … Ci and … Cm, reordering and outputting according to the contact ratio from high to low;

the word segmentation comprises the following steps:

the comparing comprises the following steps: