CN115019777B

CN115019777B - Online learning voice recognition response device and method

Info

Publication number: CN115019777B
Application number: CN202210695667.2A
Authority: CN
Inventors: 胡劲松; 冯思铭; 贺映玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2024-03-08
Anticipated expiration: 2042-06-20
Also published as: CN115019777A

Abstract

The invention discloses an on-line learning voice recognition answering device and method, which are used for recognizing the voice of a telephone call into characters and giving out relevant machine voice answers according to the characters, and particularly the telephone automatic answering device has an on-line learning function, and can be used for replacing telephone manual customer service, a telephone consultation system, a telephone command decision system and the like. The invention realizes 2 paths of analog voice recognition by using the sound card of a common computer, recognizes and extracts special vocabulary in dialogue voice by combining a difference frequency principle, and improves the voice recognition rate and the answering accuracy.

Description

Online learning voice recognition response device and method

Technical Field

The invention relates to the technical field of voice recognition, in particular to an online learning voice recognition response device and method.

Background

In order to solve the problems posed by customers, many companies adopt telephone manual customer service systems, a large number of customer service personnel are needed to be used, time and effort are wasted, consultation services are difficult to be provided for 24 hours a whole day, and along with the development of artificial intelligence, automatic response systems or devices exist at present, but most of the automatic response systems or devices can only mechanically answer some simple problems, and many times of manual customer service intervention is needed. There are several technical problems:

1. Inaccurate telephone speech recognition cannot obtain accurate characters of a customer problem, so that an accurate answer cannot be found, and one important reason is that: customer service response systems are usually specific to specific specialized users, and the response process usually involves a large number of terms and place names, store names and equipment names with specific numbers specific to various departments or stores. Because of the existence of a large number of homophones in the language, the current voice recognition technology often recognizes the frequently-used special vocabulary into other common vocabulary, so that the error rate is high, and the requirement of professional response is difficult to meet. Therefore, the main reason for causing the above problems is that: the current voice recognition technology is based on frequency priority matching, and when voice is changed into pinyin, the universal vocabulary with higher frequency at ordinary times and popular vocabulary are matched with each other;

2. the difficulty of directly searching answers through questions and sentences of clients is high, because the current semantic understanding technology cannot meet the commercial requirements in the research, and the word expression modes of human language are changed widely, the same meaning can have a plurality of expression modes, and the same meaning is difficult to match by using a fixed sentence pattern, so that the machine answers frequently and frequently, and manual intervention is needed;

3. The questions of the clients are thousands of, and are difficult to anticipate, and a fixed answer base is difficult to deal with;

further, the expert telephone consultation system, the decision command intelligent response system and the power dispatching intelligent response system are response devices or systems, are the same as the customer response system in principle, face the same problems, and the intelligent sound box does not need a telephone, but is also in voice response, and the response is unsatisfactory. In addition, the customer service system responding through the characters, such as the customer service of an electronic commerce, also faces the 2 nd and the 3 rd problems.

Disclosure of Invention

The first object of the present invention is to overcome the drawbacks and disadvantages of the prior art, and to provide an online learning voice recognition response device, which can accurately recognize voice as text, automatically respond, and learn on-line a manual customer service to continuously supplement the existing answer library.

A second object of the present invention is to provide a method for online learning of speech to text in a speech recognition answering apparatus.

A third object of the present invention is to provide an automatic text answering method for online learning of an online learning voice recognition answering apparatus.

A fourth object of the present invention is to provide a method for extracting words from sentences and searching and ordering the results of an online learning voice recognition response device.

For normalization, the terms used in this invention are defined as follows: the words refer to Chinese words, including all abbreviations and names of the words are stored together with the words, and are regarded as the same words, the local special words refer to words only used in a local machine, a local area network, a specific region, a group or a department, the local special words and the professional terms are collectively referred to as special words, and other words are referred to as general words; the word frequency refers to the frequency of occurrence of a word; the difference frequency refers to the frequency difference of the vocabulary; the matching is to calculate the similarity between a part of the pinyin string A and the correct pinyin of a certain Chinese word or word, and is also called the matching of the pinyin and the word or word for short in the invention.

The first object of the invention is achieved by the following technical scheme: an online learning speech recognition response device comprising: the system comprises a voice-to-text module, a response generating unit and a voice synthesizing unit;

the voice-to-text module recognizes voice digital signals of the questioner as corresponding characters and outputs the corresponding characters to the response generation unit, wherein the questioner refers to a person who asks a question; the voice-to-text module also recognizes voice digital signals of the answer person as corresponding characters and outputs the corresponding characters to the answer generation unit, the answer person refers to the person answering the question, and the voice-to-text module can respectively realize the conversion of the two different voice sources in a time-sharing working mode;

The response generating unit queries according to the words converted by the voice of the questioner to generate corresponding response words; the answer generation unit can generate a new answer by using characters generated by the voice of the respondent for later inquiry;

the voice synthesis unit synthesizes voice signals according to the words output by the response generation unit and outputs the voice signals to the sounding device to realize machine voice response.

Preferably, the on-line learning voice recognition response device further includes: telephone monitor, sound card; the voice-to-text module comprises two subunits which can work independently and simultaneously: the first voice-to-text unit and the second voice-to-text unit;

the first voice-to-text unit recognizes the voice digital signal of the answer person as corresponding text and outputs the corresponding text to the answer generation unit;

the second voice-to-text unit recognizes the voice digital signal of the questioner as corresponding text and outputs the text to the response generating unit;

the telephone monitor and the telephone of the dispatcher are connected in parallel with the same telephone Line, 2 paths of analog voice signals for answering the call of the person and the questioner are obtained, and the analog voice signals are respectively and correspondingly output to a first Line in interface and a second Line in interface of the sound card;

The sound card comprises a first Line in interface and a second Line in interface, 2 paths of analog voice signals of the respondent and the questioner are converted into 2 paths of digital signals through an analog/digital circuit of the sound card, and the 2 paths of digital signals are respectively and correspondingly output to a first voice-to-text unit and a second voice-to-text unit.

Preferably, the sound card, the first voice word conversion unit, the second voice word conversion unit, the response generation unit and the voice synthesis unit are all built in the same computer, and the first voice word conversion unit and the second voice word conversion unit are respectively realized by two cores of a CPU of the computer in parallel.

Preferably, the on-line learning voice recognition response device further includes:

the differential frequency special word stock unit is used for storing classified special words and pinyin thereof for the voice-to-word unit to inquire, so that the accuracy of matching of the special words is improved, the level of the words is determined by the difference of two frequencies of the words, namely, the higher the frequency of the words in special data is, the higher the level of the words is, the lower the level of the words is, the words refer to Chinese words, all abbreviations comprising the words are stored together with the words, and the same word is calculated, wherein the special words comprise local special words and professional terms; the local special vocabulary refers to the vocabulary only used in the local machine, the local area network, a specific region, group or department, the special vocabulary of the same level is stored in the same sub-library, the highest sub-library is a first-level sub-library, the sub-libraries are sequentially from the second level to the lowest sub-library, and the vocabulary stored in the differential frequency special vocabulary unit is called differential frequency special vocabulary or differential frequency vocabulary;

The subject word sharing unit is used for extracting subject words in the existing dialogue texts of the questioner and the respondent, and providing the subject words for the first voice-to-text unit and the second voice-to-text unit to query so as to improve the follow-up dialogue recognition rate, and comprises the following modules:

the subject term determination module: counting repeated vocabulary and repeated times; if the repeated vocabulary is a difference frequency special vocabulary, adding the vocabulary into a subject word queue, otherwise, discarding the vocabulary, wherein the front is the text obtained by converting the existing dialogue voice by a first voice-to-text unit and a second voice-to-text unit;

the topic word queue ordering module: assuming that n dialogue sentences are recognized as n text sentences from the start of the current speech recognition to the current speech sentence to be recognized, and the number of the current speech sentence to be recognized is n+1st sentence, the topic value of one repeated vocabulary is:

wherein i and j are repeated when the vocabulary is in the ith and j sentences, ellipses represent other repeated text sentences, i and j are less than n, G is the level of a sub-library of a difference frequency special word library to which the vocabulary belongs, the value of the sub-library is an integer, the topic values of all topic words in the first n text sentences are calculated, and then the topic word queues are obtained according to the topic values from big to small.

Preferably, the word stock unit dedicated to the difference frequency includes: 1. the second, third and fourth level sub-library modules are used for storing first, second, third and fourth level difference frequency vocabularies and difference frequency values thereof, and vocabularies with higher difference frequency values in the same level sub-library are more front in sub-library queuing;

the vocabulary and difference frequency values in the first, second, third and fourth-level sub-library modules are obtained and updated by a construction unit, and the construction unit comprises:

the text data acquisition module is used for acquiring text data comprising local professional files, call texts, chat texts and keyboard input history records, searching professional articles on a network, wherein the call texts are obtained by the first voice text conversion unit and the second voice text conversion unit and are continuously provided for the text data acquisition module;

the special word frequency dictionary module is used for cleaning and word segmentation operation on the collected text data to obtain a vocabulary list, and then carrying out special word frequency statistics on the vocabulary list and storing the vocabulary list; wherein, the special word frequency=the number of times the word is repeated×the word length/total number of words of the whole data;

the universal word frequency dictionary module is used for carrying out word segmentation operation on news data comprising a daily corpus of people, newness, fox searching and Internet easy website to obtain a vocabulary list, and then carrying out universal word frequency statistics on the vocabulary list and storing, wherein the universal word frequency = the number of times the word is repeated multiplied by the length of the word/the total word number of all the data;

The difference frequency operation module is used for performing difference frequency operation on each vocabulary of the special word frequency dictionary, wherein the difference frequency operation is as follows:

difference frequency value = special word frequency of one word-k x its general word frequency, where k is a fixed coefficient;

the difference frequency distribution module is used for storing 25% of vocabulary with the top ranking of the difference frequency value into the first-level sub-library module, 26% to 50% of vocabulary into the second-level sub-library module, 51% to 75% of vocabulary into the third-level sub-library module, and the other vocabulary with the difference frequency value being greater than 0 into four levels, and the difference frequency value is less than or equal to 0. Preferably, the first voice-to-text unit and the second voice-to-text unit are the same, and each of the first voice-to-text unit and the second voice-to-text unit comprises the following modules:

the level priority matching module: the method comprises the steps that after phonetic conversion is carried out, a phonetic string composed of letters and tones is obtained, in the process that the letters are named A, and the letters are changed into words, the phonetic string is preferentially matched with the phonetic letters stored in a first-level sub-library module of a difference frequency special word library unit, if the phonetic letters are successfully matched, part of the phonetic letters of the letters are changed into words, and the next level is not successfully matched until the last-level sub-library module is reached;

frequency priority matching module: after the level priority matching module finishes matching, matching the remaining pinyin of the A with the pinyin of the universal vocabulary, wherein the non-special vocabulary with high frequency in the universal data is preferentially matched, and finally the remaining pinyin is matched with the pinyin of the single Chinese character;

The system comprises a grade priority matching module, a subject word matching module, a first grade priority matching module, a second grade priority matching module and a first grade priority matching module, wherein the grade priority matching module is used for carrying out subject word matching on A and a subject word queue, starting from the first subject word of the queue, if the matching is successful, part of pinyin of A is changed into characters, and the next subject word is not considered until the last subject word of the queue is successfully matched;

wherein, the matching is realized by the following two modules, including:

a phoneme editing distance calculating module: the phoneme editing distance refers to the minimum number of phoneme editing operations required for converting one into the other between two pinyin strings, wherein the phonemes refer to initials or finals of pinyin, and the allowed editing operations comprise: inserting an initial consonant/vowel, deleting an initial consonant/vowel, replacing one initial consonant/vowel with another, and replacing fuzzy tones for one time only 0.5 times, wherein the above operations do not contain tones;

and a judgment output module: if the matched words are special words, outputting a phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold value, otherwise, giving a matching failure signal; outputting a phoneme editing distance if the universal vocabulary is matched;

wherein, the level priority matching module comprises:

Reverse word taking module: the pinyin of the vocabulary with the highest difference frequency value is taken from the unmatched vocabulary in the first-stage sub-library module, the name of the pinyin is set as B, and if the vocabulary in the first-stage sub-library module is matched, the pinyin is forwarded to the next-stage sub-library module;

and the arbitrary position conversion pinyin module: and searching a substring C similar to B in A, and if the matching of B and C is successful, converting C into a corresponding Chinese vocabulary. If there are multiple substrings similar to B in A, the above operations are repeated; the substring C may be located at any position of A.

Preferably, the response generating unit includes:

and a query module: the input of the system is the output of a second voice-to-text unit, namely, a text sentence of a questioner for asking questions is set as A2, all the words in the A2 are used as a word set A2S to be output to a question-answer total library module for inquiring, the words comprise a subject word, a difference frequency word and a general word, and the A2S is obtained by the voice-to-text module;

and the contact ratio calculating module is used for: setting a word sentence B2 and a vocabulary set B2S of a certain stored question in a question-answer summary library, if the coincidence ratio of the B2S and the A2S is larger than a set threshold value, storing answer word sentences corresponding to the B2 in the question-answer summary library into an answer sequence, and thus obtaining a plurality of answers to be stored into the answer sequence until the search of the question-answer summary library is completed;

And a sequencing module: the answer sequences are arranged from big to small according to the overlap ratio, and the first answer of the sequence is output to the voice synthesis unit to synthesize sound; if the answer sequence is empty, outputting a signal requesting intervention of the answer person;

question and answer summary library: the system comprises a query module, a query module and a storage module, wherein the query module is used for storing questions and answers of the questions and vocabulary sets of the questions and the answers of the questions which are queried by a questioner;

and an online learning module: the method comprises the steps of inputting text sentences which are answered by a person, storing the answer and a vocabulary set thereof as answers into a question and answer collection library, and simultaneously storing question characters and vocabulary sets thereof corresponding to the answer, wherein the online learning module is started only when questions are manually asked and answered;

preferably, the overlap ratio is calculated as follows:

the method comprises the steps of setting p subject words in a vocabulary set of two text sentences to be the same, and sequencing the words from high to low according to the subject values: subject 1, subject 2 … subject p; and r difference frequency words are the same in the two word sets, and are ranked from high to low according to the difference frequency values: difference frequency vocabulary 1, difference frequency vocabulary 2 … difference frequency vocabulary r; and if j universal words are the same in the two word sets, then:

the overlap ratio of the two vocabulary sets = t1+t2+ … +tp+q1+q2+ … +qr+u1+u2+ … +uj;

Here, T1, T2 … Tp, Q1, Q2 … Qr, U1, U2 … Uj are preset weight coefficients;

the preset weight coefficients T1, T2 … Tp, Q1, Q2 … Qr, U1, U2 … Uj satisfy the following conditions:

T1+.T2+. … +.Tp+.Tp1+.Q2+. … +.Qr+.U); here U denotes any one of U1, U2 … Uj.

The second object of the invention is achieved by the following technical scheme: a method for online learning voice to text of a voice recognition response device comprises the following steps:

s1, converting voice into pinyin: analyzing and identifying the digitized signal of the voice, and obtaining a whole sentence of pinyin A corresponding to the voice;

s2, performing subject word matching on the A;

s3, performing level priority matching on the remaining pinyin of the A;

s4, carrying out frequency priority matching on the remaining pinyin of the A;

s5, matching the remaining pinyin with single Chinese characters to obtain a whole sentence text;

s6, outputting the whole sentence of characters; and outputting the vocabulary classification obtained by matching the S2, the S3, the S4 and the S5 to a subject word sharing unit, a difference frequency special word bank and a general word frequency dictionary so as to refresh a subject word queue, a difference frequency value, sequencing and vocabulary frequency, and simultaneously outputting the vocabulary classification as a set to a response generating unit.

The third object of the invention is achieved by the following technical scheme: an automatic text answering method for online learning of an online learning voice recognition answering device is used for automatically answering text questions of a questioner on a network and automatically learning answers of the questioner, and comprises the following steps:

Extracting words: setting an input sentence as a section of text sentence A2, and utilizing a subject word sharing unit, a difference frequency vocabulary library and a universal vocabulary library to divide the A2 into words to obtain a vocabulary set A2S; the vocabulary set comprises subject words, difference frequency vocabulary and general vocabulary, and the questioner refers to a person who gives questions, including clients;

querying: A2S is output to a question and answer summary library for inquiry;

and (3) calculating the coincidence degree: setting a word sentence B2 and a vocabulary set B2S of a certain stored question in a question-answer summary library, if the coincidence ratio of the B2S and the A2S is larger than a set threshold value, storing answer words corresponding to the B2 in the question-answer summary library into an answer sequence, obtaining a plurality of answers by a query step, and storing the answers into the answer sequence until the search of the question-answer summary library is completed;

sequencing: the answer sequences are arranged according to the overlap ratio from large to small, and a first answer of the sequence is output to a questioner; if the answer sequence is empty, outputting a signal requesting intervention of a person answering the question, wherein the person answering the question comprises a manual customer service;

question and answer summary library: the method comprises the steps of inquiring a question by a questioner, and storing answers to the question and vocabulary sets of the question and the vocabulary sets for the questioner and the questioner to use in the inquiring step;

on-line learning: the method comprises the steps of inputting answer text sentences of a person to be answered, storing the answer and vocabulary sets thereof as answers into a question and answer collection library, storing question text corresponding to the answer and vocabulary sets thereof at the same time, and starting an online learning step only when questions are manually asked and answered;

The length of the answer sequence is 1, namely only the answer with the largest overlap ratio is reserved, and when the overlap ratio of the new answer is larger than the answer reserved in the sequence, the old answer is replaced by the new answer, otherwise, the answer is unchanged.

The fourth object of the invention is achieved by the following technical scheme: a method of extracting words from sentences for searching and ordering results for an online learning speech recognition response device, comprising:

extracting words: setting an input sentence as a section of text sentence A2, and dividing the A2 into words by using a subject word sharing unit, a difference frequency word library and a general word library to obtain a word set A2S, wherein the word set comprises difference frequency words and general words;

searching: searching the network or the local database with A2S to obtain a plurality of results C1, C2, … Ci, … Cm, ci representing i results;

and (3) calculating the coincidence degree: let Ci contain r difference frequency words in A2S, and order from high to low according to the difference frequency value: difference frequency vocabulary 1, difference frequency vocabulary 2 … difference frequency vocabulary r; let Ci also contain j universal words in A2S, then:

overlap ratio of Ci = q1+q2+ … +qr+u1+u2+ … +uj; here, Q1, Q2 … Qr, U1, U2 … Uj are preset coefficients;

sequencing results: after calculating the coincidence degree of C1, C2, … Ci and … Cm, reordering from high to low according to the coincidence degree and outputting;

The preset coefficients Q1, Q2 … Qr, U1, U2 … Uj satisfy the following conditions: q1 ≡ Q2 ≡ … ≡ Qr ≡ U, where U represents any one of U1 and U2 … Uj;

the word segmentation includes:

level priority comparison: setting a section of text sentence A2, preferentially comparing the text with the characters stored in a first-stage sub-library module of the differential frequency special word library unit, dividing part of the text of A2 into a word if the comparison is successful, storing a vocabulary set named A2S, and considering the next stage until the comparison is unsuccessful until the last-stage sub-library module, and sequencing the words in A2S according to the divided time; the comparison is to calculate the character similarity between a part of characters in A2 and a certain Chinese vocabulary in the vocabulary library;

frequency priority comparison: after the level priority comparison is completed, comparing the characters left in the A with the characters of the universal vocabulary, comparing the non-special vocabulary with high frequency in the universal data with the priority, and storing the remaining characters in the A2S;

the comparison includes:

reverse word extraction: taking the vocabulary with the highest difference frequency value from the vocabularies which are not compared in the first-stage sub-library module, setting the name of the vocabulary as D, and if the vocabularies in the first-stage sub-library module are all compared, continuing to the next-stage sub-library module;

Arbitrary position division: searching a character string E similar to D in A2, if the comparison of E and D is successful, dividing E into corresponding Chinese words, and if a plurality of substrings similar to D exist in A2, repeating the above operation; the E can be located at any position of A2.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. 2 voice recognition functions required by the telephone customer service automatic response system can be realized on a common desktop computer: the customer call voice recognition and the manual customer service personnel call voice recognition do not need a plurality of voice recognition devices, so that the cost is saved.

2. The invention can automatically distinguish the general vocabulary and the special vocabulary, especially those local special vocabulary, thus the manual database establishment of each regional department is not needed, and the special vocabulary is stored into a graded difference frequency special word database and is continuously refreshed, updated and replaced, thereby saving a great amount of time and energy of customer service personnel.

3. The level priority matching of the invention takes the special vocabulary as the key priority matching, thereby reducing the error caused by the prior voice recognition method due to the priority of the popular general vocabulary and improving the voice recognition accuracy.

4. The invention can learn the answers of the manual customer service on line, and can supplement the existing answer library continuously while learning, thereby being capable of better coping with various questions of customers.

5. The coincidence degree method can distinguish the importance of each vocabulary in the sentences and match the answers more accurately.

Drawings

Fig. 1 is a block diagram of a structure of an online learning voice recognition response device.

Fig. 2 is a flow chart of an automatic text answering method for online learning.

Fig. 3 is a flow chart of a voice-to-text process.

Detailed Description

Besides the automatic response customer service system, the device of the invention can also be applied to expert consultation systems, electric power intelligent dispatching systems, telephone decision support systems and remote disease diagnosis systems, wherein one person mainly gives questions or reports specific conditions of the site to seek corresponding countermeasures, the other person mainly answers questions or gives decisions, such as the conversation of an electric power dispatcher and a site maintenance operator, the conversation of a commander of a decision center and a site operator, the remote conversation diagnosis of a doctor and a patient and the like, so that the device is commonly seen, and is hereinafter collectively called a questioner and a replying person, the replying of the replying person is not necessarily a final answer, and a certain further choice or question is possibly provided for the questioner to guide the questioner to more clearly set forth own questions; the text sentence refers to a text with an unlimited length.

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Referring to fig. 1, the present embodiment discloses an online learning voice recognition response device, which includes:

the device comprises a telephone monitor, a first impedance matching device K1, a second impedance matching device K2, a sound card, a first voice-to-word unit M1, a second voice-to-word unit M2, a word stock unit special for difference frequency, a construction unit, a subject word sharing unit, a response generating unit and a voice synthesizing unit.

The telephone monitor and the telephone are connected in parallel with the same telephone line, 2 paths of analog voice signals for answering a call of a person (such as a person in a customer service) and a person in question (such as a customer) are obtained, and the analog voice signals are respectively and correspondingly output to the first impedance matching device K1 and the second impedance matching device K2. The telephone monitor does not influence manual call receiving and making, and is convenient for manual customer service to intervene at any time.

The first impedance matching device K1 and the second impedance matching device K2 can adjust the impedance, so that the intensity of an input analog voice signal changes to adapt to the signal intensity requirement of a Line in interface of the sound card, and the first impedance matching device K1 and the second impedance matching device K2 respectively output to the first Line in interface 1 and the second Line in interface 2 of the sound card correspondingly. Of course, if the strength of the analog voice signal is just within the range of the sound card, no impedance matching means may be used.

In fig. 1, the sound card includes a first Line in interface 1 and a second Line in interface 2, and the 2 input interfaces respectively receive 2 analog signals of the voice of the answering person and the voice of the questioner, and convert the analog signals into 2 digital voice signals through 2 analog/digital circuits of the sound card, wherein the digital signals of the voice of the questioner are output to a second voice-to-text unit M2, and the output of the answering person is output to a first voice-to-text unit M1.

In fig. 1, a first voice-to-text unit M1 receives a digital voice signal for answering a call from a person, recognizes the digital voice signal as corresponding text, and outputs the text, which is used as input of a construction unit for updating a difference frequency vocabulary and a difference frequency value; the characters are also used as the input of the constituent subject word sharing unit for extracting the subject word of the telephone call between the answer and the questioner.

In fig. 1, the second voice-to-text unit M2 receives the digitized signal of the voice of the questioner, recognizes it as a corresponding text, and outputs it; the characters are used as the input of the construction unit and are used for updating the difference frequency vocabulary and the difference frequency value; the characters can be used as the input of the subject word sharing unit for extracting the subject word in the telephone call text of the respondent and the questioner, and meanwhile, because the M1 and the M2 share the differential frequency special word stock unit and the subject word sharing unit, the M1 and the M2 processes can work independently, can interact and complement each other and are promoted together, so that the recognition accuracy can be improved. Details of M1 and M2 will be described in further detail below.

The sound card, the first voice word conversion unit M1, the second voice word conversion unit M2, the response generation unit and the voice synthesis unit are all built in the same computer, and the first voice word conversion unit M1 and the second voice word conversion unit M2 are respectively and parallelly realized by two cores of a CPU of the computer.

The importance of the special vocabulary in the telephone answering device is higher than that of the common vocabulary, so that the recognition rate of the special vocabulary is guaranteed preferentially, a special vocabulary library is established, and further, under the condition that the matching similarity is not great, the special vocabulary of high level is matched preferentially.

The differential frequency special word stock unit is used for storing classified special words and pinyin thereof so as to be inquired by two voice-to-word units, thereby improving the matching accuracy of the special words, the grade of the words is determined by the differential frequency value of the special words, the words refer to Chinese words, all abbreviations and names of the words are stored together with the words and are regarded as the same word, the special words comprise local special words and special terms, the local special words refer to words only used in a local machine, a local area network, a specific region, a group or a department, the special words at the same grade are stored in the same sub-stock, the highest sub-stock is a first sub-stock, the next sub-stock is a second sub-stock to a fourth sub-stock in sequence and is used for storing the first, second, third and fourth differential frequency words and the differential frequency value of the words, and the words with higher differential frequency value in the same sub-stock are more queued before the sub-stock.

In addition, the device can automatically build a word stock special for difference frequency through a program. To automatically distinguish a specific vocabulary from a normal vocabulary, the difference must be utilized. Specific vocabulary, in particular local specific vocabulary, for example the customer question "what is the green flagship version laser rangefinder? "green flagship version" is a special vocabulary, which generally does not appear in ordinary news or articles, but may appear in local files, local browser records, local chat records, local keyboard input records, local store equipment records, local call text records, and the like, whereas general vocabulary such as "functions" frequently appear in ordinary articles or netbooks, and professional vocabulary "laser rangefinder" may appear in local texts, academic articles, and news reports, so this patent proposes: the level of a vocabulary is determined by the difference in the frequencies of the two, i.e., the higher the frequency of occurrence in the private data, the higher the vocabulary level, and the higher the frequency of occurrence in the general data, the lower the vocabulary level.

The construction unit is used for automatically constructing a word bank special for the difference frequency and updating the vocabulary and the difference frequency value in the word bank special for the difference frequency, and comprises the following steps:

1) The text data acquisition module is used for acquiring text data comprising a local file, a local browser record, a local chat record, a local keyboard input record, a local store equipment record, a local call text record and the like, searching professional academic articles on a network, and continuously providing the call text to the text data acquisition module by the first voice-to-text unit M1 and the second voice-to-text unit M2;

2) The special word frequency dictionary module is used for cleaning and word segmentation operation on the collected text data to obtain a vocabulary list, and then carrying out special word frequency statistics on the vocabulary list and storing the vocabulary list; wherein, the special word frequency=the number of times the word is repeated×the word length/total number of words of the whole data;

3) The universal word frequency dictionary module is used for carrying out word segmentation operation on the daily news corpus of people and news of new waves, search foxes and Internet easy three websites to obtain a vocabulary list, and then carrying out universal word frequency statistics on the vocabulary list and storing the vocabulary list, wherein the universal word frequency = the number of times the word is repeated multiplied by the length of the word/the total word number of all data; the method comprises the steps of carrying out a first treatment on the surface of the

4) The difference frequency operation module is used for performing difference frequency operation on each vocabulary of the special word frequency dictionary, wherein the difference frequency operation is as follows:

5) The difference frequency distribution module is used for storing 25% of vocabulary with the top ranking of the difference frequency value into the first-level sub-library module, 26% to 50% of vocabulary into the second-level sub-library module, 51% to 75% of vocabulary into the third-level sub-library module, and other vocabulary with the difference frequency value being greater than 0 into the fourth level, and the difference frequency value being less than or equal to 0 is removed.

When the customer and customer service personnel carry out voice dialogue communication, larger background noise is often generated, so that the accuracy of voice recognition is seriously reduced. In noisy environments, some words may not be well heard, and people often can guess some words which are not well heard from the context of the conversation, but current speech recognition algorithms only consider recognizing single-sentence speech and cannot utilize consistent topic semantics in the conversation context, which is also a weakness of current speech recognition algorithms. A preferred scheme is to add subject word matching before level priority matching, so that the topic of the dialogue is defined, and the recognition rate of the whole dialogue can be improved.

The subject word sharing unit in fig. 1 is configured to extract subject words in the existing dialogue text of a questioner (client) and a responder (customer service person), and provide the extracted subject words to the first voice-to-text unit M1 and the second voice-to-text unit M2 for query, so as to improve the recognition rate of subsequent dialogs, and includes the following modules:

1) The subject term determination module: counting repeated vocabulary and repeated times; if the repeated vocabulary is a difference frequency special vocabulary, adding the vocabulary into a subject word queue, otherwise, discarding the vocabulary, wherein the front is characters obtained by converting the existing dialogue voice by a first voice-to-text unit M1 and a second voice-to-text unit M2;

2) The topic word queue ordering module: if n dialogue sentences are recognized from the start of the current speech recognition to the current sentence to be recognized, and the number of the current sentence to be recognized is n+1th sentence, the topic value of one repeated vocabulary is as follows:

wherein i and j are repeated when the vocabulary is in the ith and j sentences, ellipses represent other repeated sentences, i and j are less than n, G is the level of a sub-library of the differential frequency special word library to which the vocabulary belongs, and the value of the sub-library is an integer from 1 to 4. Calculating the topic values of all the topic words in the first n sentences, and queuing the topic words from large to small according to the topic values to obtain a topic word queue;

the first speech-to-text unit M1 and the second speech-to-text unit M2 in fig. 1 are the same, and each comprises the following modules:

1) The subject term matching module: after phonetic conversion, a phonetic string composed of letters and tones is obtained, and in the process of changing the names A and A into characters, firstly, subject words are matched, A is matched with a subject word queue, when the matching is successful from the first subject word of the queue, part of phonetic letters of A are changed into characters, and the next subject word is not considered until the last subject word of the queue is reached; the module is only started when the questioner and the answering person are in telephone conversation, otherwise, the module directly enters the level priority matching module.

2) The level priority matching module: after the subject words are matched, the remaining pinyin of the A is matched with the pinyin of the vocabulary stored in the first-level sub-library module of the difference frequency special word library unit preferentially, if the matching is successful, part of the pinyin of the A is changed into characters, and the next level is not considered successfully until the last four-level sub-library module; the level priority matching module comprises two sub-modules: the reverse word taking module is used for taking the pinyin of the word with the highest difference frequency value from the unmatched words in the first-stage sub-library module, setting the name of the pinyin as B, and if the words in the first-stage sub-library module are matched, extending the pinyin to the next-stage sub-library module; and the pinyin conversion module at any position searches the substring C similar to B in the A, and if the matching of the B and the C is successful, the substring C is converted into a corresponding Chinese vocabulary. If there are multiple substrings in A that are similar to B, the above operations are repeated; the substring C may be located at any position of A.

3) Frequency priority matching module: after the level priority matching module finishes matching, matching the remaining pinyin of the A with the pinyin of the universal vocabulary, wherein the non-special vocabulary with high frequency in the universal data is matched preferentially, and finally the remaining pinyin is matched with the pinyin of the single Chinese character.

The matching used in the voice-to-text unit is realized by a matching module, and the matching of pinyin, vocabulary and characters can be realized according to a known method, and the invention provides a preferable matching scheme comprising the following steps:

1) A phoneme editing distance calculating module: referring to the minimum number of phoneme editing operations required between two pinyin strings to be converted from one to the other, the phonemes refer to initials or finals of the pinyin, and the allowed editing operations include: inserting an initial consonant/vowel, deleting an initial consonant/vowel, replacing one initial consonant/vowel with another, and only calculating 0.5 times of replacement between fuzzy sounds once; examples: assuming that the Yue Tang station "yue4 tan2 zhan4" speaks "yue tan2 zhan4" because of the mandarin non-standard of the speaker, the correct pinyin can be obtained by replacing a final ang where an and ang are ambiguous sounds each other, so the phoneme edit distance is 0.5.

2) And a judgment output module: if the matched words are special words, outputting a phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold value, otherwise, giving a matching failure signal; if the universal vocabulary is matched, outputting the phoneme editing distance.

The tone of pinyin is not considered here, and because of the large number of Chinese dialects, the pronunciation of each place is greatly different, and the tone is difficult to distinguish by many people, and the tone is affected by the change of intonation and mood.

The answer generation unit in fig. 1 is a core of the apparatus, and is configured to generate an answer, including:

and a query module: the input is the question text of the questioner, as above example, the question "what is the green flagship version laser rangefinder of the customer's question? Setting the words as A2, wherein the subject words, the difference frequency words and the universal words in A2 are obtained in the matching process of the voice conversion words, the words are arranged according to the time sequence, namely, a word set A2 S= { green flagship edition (difference frequency words), a laser rangefinder (professional words), functions (universal words), what (universal words) are (universal words) }, the words are output to a question and answer collection library module for inquiry, a word sentence B2 of a certain problem stored in the question and answer collection library and a word set B2 S= { green flagship edition (difference frequency words), the laser rangefinder (professional words), the functions (universal words) }, and the coincidence degree of B2S and A2S is larger than a set threshold, and then the answer word sentence { green flagship edition laser rangefinder corresponding to B2 in the question and answer collection library has three functions: counting down, measuring universal horizontal bubble, measuring secondary hook strand, storing an answer sequence, and continuously searching the question and answer summary library by using A2S, so that a plurality of answers are obtained and stored in the answer sequence until the question and answer summary library is searched; the answer sequences are arranged from big to small according to the overlap ratio, and the first answer of the sequence is output to the voice synthesis unit to synthesize sound; if the answer sequence is empty, outputting a signal requesting intervention of a responder, and enabling a manual customer service to answer the question and simultaneously starting a learning module;

and a learning module: the method comprises the steps of inputting answer characters which are answer persons, storing the answer and vocabulary sets thereof as answers into a question and answer collection library, and simultaneously storing question characters corresponding to the answer and vocabulary sets thereof, wherein the learning module is started only when questions are manually asked and answered; in the above example, assuming that none of the answer libraries can match A2S, then the answer sequence is empty, the human customer service sees the request intervention signal, she picks up the phone answer "green flagship version laser rangefinder has three major functions: the method comprises the steps of countdown measurement, universal horizontal bubble and secondary collude measurement, wherein the voice is converted into a text sentence by a second voice-to-text unit M2 and is stored in a question-answer summary library as an answer, and A2S is correspondingly stored in the question-answer summary library and is bound into a pair of questions and answers.

The overlap ratio in the query module is calculated as follows:

The overlap ratio of the two vocabulary sets = t1+t2+ … +tp+q1+q2+ … +qr+u1+u2+ … +uj. Here, T1, T2 … Tp, Q1, Q2 … Qr, U1, U2 … Uj are preset weight coefficients. The more important the vocabulary, the higher its weight, and therefore: t1+ t2+ … + tp+ q1+ q2+ … + qr+ U, where U represents any one of U1, U2 … Uj.

Examples: a2s= { green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary), what (general vocabulary), there is (general vocabulary) }, b2s= { green light flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary) }, coincident vocabulary: green flagship version (difference frequency vocabulary), laser range finder (professional vocabulary), function (general vocabulary), overlap ratio=0.6+0.3+0.1=1. In this example, the "green flagship edition" is the most important vocabulary, and the "laser rangefinder" is the second, and this general vocabulary of "function" is unimportant, and the contribution to the inquiry is minimum, through the calculation of the coincidence degree, makes we can decompose a sentence according to the importance of vocabulary to can search like using the keyword, various network search engines that use at present need the manual input keyword, can not use the sentence search, need certain skill, it is very friendly to the beginner.

Furthermore, the response generating unit of the invention can be independently used as a text automatic response function, and when the input is text, a relevant device for voice recognition is not needed, so that the response generating unit has a plurality of new application scenes.

The first new application is web chat question-answering, such as Taobao's Alwangwang chat tool, where the customer and customer service are both manually input words, and where the answer generation method is slightly different from the answer generation unit in FIG. 1, the flow of the method is shown in FIG. 2, in which the subject words, difference frequency words and common words need to be separated from the word sentences input by the customer because there is no step of converting words by speechWord extraction is similar to pinyin matching except that pinyin matching is replaced by text comparison, e.g., what is the text sentence A2 "green flagship version laser rangefinder function of the client input? "the difference frequency words in the sentences are ordered according to the difference frequency values: green flagship edition > laser range finder, the others are universal vocabulary. 1) Reverse word extraction: the first-level sub-library is used for extracting words one by one from large to small according to the difference frequency value, and each word is extracted to find whether the sub-strings which are successfully compared exist in A2; 2) Arbitrary position division: the existing method is to divide words from the first word, the method is different in that substrings can be divided at any position of the string A2, if the matching gap is larger than a given threshold value, the substring is abandoned, the latter word is fetched until the green flagship edition can be compared with the corresponding part in the string A2, and thus the string A2 is changed into [ Green light flagship plateWhat function the laser rangefinder has]. The reverse word taking and the arbitrary position division are specially designed for the difference frequency special words, and are different from the currently known method. Similarly, the rest special vocabulary of the string A2 is divided, and the common vocabulary is finally divided: [ green flagship version laser distance meter has what function]. The subsequent steps are the same as those of the response generating unit in fig. 1.

As shown in FIG. 2, the automatic text answering method for online learning provided by the invention is used for automatically answering text questions of a questioner on a network and automatically learning answers of the questioner, and comprises the following steps:

querying: A2S is output to a question and answer summary library for inquiry;

In addition, the automatic text response method can be also used for equipment such as intelligent sound boxes and the like which utilize voice sentence searching.

A second new application is a method of extracting words from sentences for searching and ordering the results, comprising:

extracting words: the input sentence is a section of text sentence A2, the word sharing unit, the difference frequency word library and the universal word library are utilized to divide the word A2 to obtain a word set A2S, the word set comprises the difference frequency word and the universal word, and the word dividing method is newly applied as in the first type.

Searching: searching the network or local database with A2S to obtain a plurality of results C1, C2, … Ci, … Cm;

overlap ratio of Ci = q1+q2+ … +qr+u1+u2+ … +uj; here, Q1, Q2 … Qr, U1, U2 … Uj are preset coefficients.

Sequencing results: after calculating the coincidence degree for C1, C2, … Ci, … Cm, the coincidence degrees are reordered from high to low and output.

FIG. 3 is a specific method and flow for converting speech into text, comprising the steps of:

s1, converting voice into pinyin. And analyzing and recognizing the digitized signals of the voice by adopting a well-known deep learning voice recognition algorithm, and obtaining the whole sentence of pinyin corresponding to the voice. For example: in the power telephone dispatch response system, the dispatcher responds to the voices of the on-site operators as follows: when the method is put into a Yue Tang station Yue Gang Hunan stone line 35 grounding knife switch and a 36 grounding knife switch, the conversion in the step S3 is carried out, so that the Pinyin string A is obtained by [ tou2 ru4 yue tan 2 zhan4 yue4 gang1 xin 1 san 2 san 4 san1 wu3 jie di4 dao1 zha2 he2 san1 liu4 jie di4 dao1 zha2 ];

S2, a subject term matching module: inquiring a subject word sharing unit, matching the A with a subject word queue in the subject word sharing unit, wherein when the matching is successful, part of the pinyin of the A is changed into characters from the first subject word of the queue, and the matching is unsuccessful, and then the next subject word is considered until the last subject word of the queue;

s3, performing Chinese text matching on the remaining pinyin in the A by using a level priority matching module, and inquiring a differential frequency special word stock. For example, yue Tang stations, yue Gang Hunan stone lines and grounding knife switches are special words, and the difference frequency values are ordered: yue Tang station (level 1) > Yue Gang Hunan stone line (level 2) > earthing knife-switch (level 3). 1) Reverse word extraction: and extracting words from the first-level sub-library one by one according to the difference frequency value, and searching whether matched sub-strings exist in the Pinyin string A or not every word. The current matching method is to take pinyin from the string A to search a vocabulary library, and the method of the patent is opposite to the method, so the method is called reverse word taking; 2) Arbitrary position conversion: the current method is to convert characters from the first letter, the method is different, the substring can be converted at any position of the string A, if the matching gap is larger than a given threshold value, the next word is discarded, and the word is fetched until the Pinyin of Yue pond station is "yue tan 2 zhan4", the corresponding part in the Pinyin string A can be matched, so that the Pinyin string A becomes [ tou ru4 Yue Tang station yue gang1 xi 1 shi2 san 4 san1 wu3 jie di4 dao1 zha h 2 san1 liu4 jie1 di4 dao1 zha2]. The reverse word taking and the random position conversion are specially designed for the difference frequency special words, and are different from the currently known method. Similarly, the remaining specialized vocabulary of string A is then converted: [ tou ru4 Yue Tang station Yue Gang Xiangshi line san1 wu3 ground knife switch he2 san1 liu4 ground knife switch ];

S5, the frequency priority matching module matches the remaining pinyin and the universal vocabulary of the A. When all the special vocabulary in the string A is converted, the universal vocabulary is matched according to the known frequency priority method: according to the sequence from front to back, tou ru4 is taken and the general dictionary is checked to obtain 'input', and the string A becomes: [ input Yue Tang station Yue Gang Hunan stone line san1 wu3 ground knife switch he2 san1 liu4 ground knife switch ];

s6, matching the rest pinyin with a single Chinese character to obtain a whole sentence of text [ input Yue Tang station Yue Gang Xiangshi line 35 grounding switch and 36 grounding switch ];

s7, outputting the whole sentence of characters; outputting the whole sentence of characters; and outputting the vocabulary classification obtained by matching the S2, the S3, the S4 and the S5 to a subject word sharing module, a differential frequency special word bank and a general word frequency dictionary so as to refresh a subject word queue, a differential frequency value, sequencing and vocabulary frequency. Examples: refreshing the difference frequency value of the difference frequency vocabulary Yue Tang station and Yue Gang Hunan stone lines and updating the sequence of the difference frequency value in the difference frequency vocabulary library, wherein the unoccupied difference frequency vocabulary does not need to be refreshed frequently; if these words still appear in the previous sentence, the subject word queue is refreshed, and if the previous sentence does not exist, the new queue is added and put at the end.

S8, if the voice is continuously input, turning to S1, otherwise, the next step is performed;

S9, ending.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. An online learning speech recognition response device, comprising: the system comprises a voice-to-text module, a response generating unit and a voice synthesizing unit;

the voice-to-text module recognizes voice digital signals of the questioner as corresponding characters and outputs the corresponding characters to the response generation unit, wherein the questioner refers to a person who asks a question; the voice-to-text module also recognizes voice digital signals of the answer person as corresponding characters and outputs the corresponding characters to the answer generation unit, the answer person refers to the person answering the question, and the voice-to-text module can respectively realize conversion of two different voice sources in a time-sharing working mode;

The voice synthesis unit synthesizes voice signals according to the words output by the response generation unit and outputs the voice signals to the sounding device to realize machine voice response;

wherein i and j are repeated when the vocabulary is in the ith and j sentences, ellipses represent other repeated text sentences, i and j are less than n, G is the level of a sub-library of a difference frequency special word library to which the vocabulary belongs, the value of the sub-library is an integer, the topic values of all topic words in the first n text sentences are calculated, and then the topic words are queued from large to small according to the topic values, so as to obtain a topic word queue;

the word stock unit special for the difference frequency comprises: 1. the second, third and fourth level sub-library modules are used for storing first, second, third and fourth level difference frequency vocabularies and difference frequency values thereof, and vocabularies with higher difference frequency values in the same level sub-library are more front in sub-library queuing;

the difference frequency distribution module is used for storing 25% of vocabulary with the top ranking of the difference frequency value into the first-level sub-library module, 26% to 50% of vocabulary into the second-level sub-library module, 51% to 75% of vocabulary into the third-level sub-library module, and the other vocabulary with the difference frequency value being greater than 0 into four levels, and the difference frequency value is less than or equal to 0.

2. The online learning speech recognition response device of claim 1, further comprising: telephone monitor, sound card; the voice-to-text module comprises two subunits which can work independently and simultaneously: the first voice-to-text unit and the second voice-to-text unit;

3. The device of claim 2, wherein the first speech to text unit and the second speech to text unit are identical and comprise the following modules:

The system comprises a grade priority matching module, a subject word matching module, a first grade priority matching module, a second grade priority matching module and a first grade priority matching module, wherein the grade priority matching module is used for carrying out subject word matching on A and a subject word queue, starting from the first subject word of the queue, if the matching is successful, part of pinyin of A is changed into characters, and the next subject word is not considered until the last subject word of the queue is successfully matched; wherein, the matching is realized by the following two modules, including:

a phoneme editing distance calculating module: the phoneme editing distance refers to the minimum number of phoneme editing operations required for converting one into the other between two pinyin strings, wherein the phonemes refer to initials or finals of pinyin, and the allowed editing operations comprise: inserting an initial consonant or a final, deleting the initial consonant or the final, replacing one initial consonant or final with another, and replacing fuzzy tones for one time only 0.5 times, wherein the above operations do not contain tones;

wherein, the level priority matching module comprises:

and the arbitrary position conversion pinyin module: searching a substring C similar to B in A, if B and C are successfully matched, converting C into a corresponding Chinese vocabulary, and if a plurality of substrings similar to B exist in A, repeating the above operation; the substring C may be located at any position of A.

4. The online learning speech recognition response device of claim 1, wherein the response generation unit includes:

and a query module: the input of the system is the output of a second voice-to-text unit, namely, a text sentence of a questioner for asking questions is set as A2, all the words in the A2 are used as a word set A2S to be output to a question-and-answer total library module for inquiring, the words comprise a subject word, a difference frequency word and a general word, and the A2S is obtained by the voice-to-text module;

and an online learning module: the method is characterized in that the method inputs a text sentence which is answered by a person, stores the answer and a vocabulary set thereof as an answer into a question and answer collection library, and simultaneously stores a question text corresponding to the answer and the vocabulary set thereof, and the online learning module is started only when the question is manually asked and answered.

5. The online learning speech recognition response device of claim 4, wherein the overlap ratio is calculated as follows:

The overlap ratio of the two vocabulary sets = t1+t2+ … +tp+q1+q2+ … +qr+u1+u2+ … +uj; here, T1, T2 … Tp, Q1, Q2 … Qr, U1, U2 … Uj are preset weight coefficients;

6. A method for online learning of speech to text in a speech recognition response device according to any one of claims 1-5, comprising the steps of:

s2, performing subject word matching on the A;

s3, performing level priority matching on the remaining pinyin of the A;

s4, carrying out frequency priority matching on the remaining pinyin of the A;

7. An automatic text answering method for online learning of an online learning voice recognition answering device according to any one of claims 1 to 5, characterized by being used for automatically answering text questions of a questioner on a network and automatically learning answers of the questioner, comprising:

extracting words: setting an input sentence as a section of text sentence A2, and utilizing a subject word sharing unit, a difference frequency vocabulary library and a universal vocabulary library to divide the A2 into words to obtain a vocabulary set A2S; the vocabulary set comprises subject words, difference frequency words and general words, and the questioner refers to a person who gives questions, including clients;

querying: A2S is output to a question and answer summary library for inquiry;

on-line learning: the method is characterized in that the method inputs answer text sentences of answer persons, stores the answers and vocabulary sets thereof as answers into a question and answer collection library, stores question text corresponding to the answers and vocabulary sets thereof at the same time, and starts the online learning step only when questions are manually asked and answered.

8. A method of extracting vocabulary from sentences and searching for results and ranking the results in an online learning speech recognition response device according to any one of claims 1-5 comprising:

the word segmentation includes:

The comparison includes: