CN202736475U

CN202736475U - Chat robot

Info

Publication number: CN202736475U
Application number: CN 201120508956
Authority: CN
Inventors: 肖南峰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2011-12-08
Filing date: 2011-12-08
Publication date: 2013-02-13
Anticipated expiration: 2021-12-08

Abstract

The utility model discloses a chat robot, which comprises a camera and a driving module thereof, a voice picking module, a voice recognizing module, a knowledge inquiry module and a voice generating module, wherein the camera captures a human face image, and the semantic meaning of a voice signal is recognized through the voice picking module and the voice recognizing module. The chat robot understands human demands according to human voices, then forms conservation statements through the knowledge inquiry module and generates voice for communication with humans through the voice generating module. The conservational robot system has voice recognizing and understanding abilities and can understand commands of users. The chat robot can be applied to schools, families, hotels, companies, airports, bus stations, docks, meeting rooms and the like for education, chat, conservation, consultation, etc. In addition, the chat robot can also help users with propaganda and introduction, guest reception, business inquiry, secretary service, foreign language interpretation and the like.

Description

A kind of chat robots

Technical field

The utility model relates to field in intelligent robotics, particularly a kind of chat robots.

Background technology

In a lot of public situation, be provided with the terminal that some are used for information inquiry.It generally is to be made of together touch-screen and computer.Inquire about or inquire about with mouse, keyboard by touching by the user, can not directly carry out information inquiry by session.Equipment with phonetic function also normally machine be provided with for the circuit of playing voice, it is 200910248546.8 Chinese patent application such as application number, a kind of self-service charging machine is disclosed, formed by main control unit (industrial computer) and peripheral functional modules, it includes the operation state display, safeguard display, service keyboard, operation display, infrared touch panel, and connected mode comprises serial ports, USB, LVDS, VGA and Ethernet interface.This equipment lacks the function of carrying out information interaction by dialogue.Therefore, need to provide a kind of can directly carry out information interaction equipment by dialogue, satisfies different users's interaction demand.

The utility model content

The purpose of this utility model is to overcome the prior art above shortcomings, and a kind of chat robots is provided, and can realize the direct dialogue of people and chat robots, can be applicable to public situation and is used for consulting, and concrete technical scheme is as follows.

A kind of chat robots, the computing machine that it comprises camera, webcam driver module, voice pickup model and is used for realizing speech recognition, knowledge query, speech production; Described voice pickup model is microphone, is used for picking up voice signal; Described camera is used for catching facial image; The number of described camera is 2.

Described camera has 5 degree of freedom.

Compared with prior art, the utlity model has following beneficial effect: described chat robots have look, listen, say, memory function.As long as user and it are to crossing once words, it just can remember user's sound, meets once face and just can be familiar with the user.It can simply talk with and serve in 24 hours round the clock.Chat robots has the speech recognition and understanding ability, and it can understand user's instruction, has stronger chat feature.Chat robots can have been grasped the language of multiple country variant, both can be the guide, is again translation, and can handles miscellaneous service, and for example inquiry data etc. are reported a case to the security authorities in reception.

Description of drawings

Fig. 1 is the composition frame chart of session robotic in the embodiment.

Fig. 2 is the synoptic diagram that the tlv triple of semantic knowledge represents method in the embodiment.

Fig. 3 is the functional-block diagram of the speech recognition of Schema-based coupling in the embodiment.

Fig. 4 is voice synthetic module frame diagram in the embodiment.

Embodiment

Below in conjunction with accompanying drawing enforcement of the present utility model is described further, but enforcement of the present utility model is not limited to this.

As shown in Figure 1, a kind of chat robots comprises camera and driver module thereof, voice pickup model, sound identification module, knowledge query module, speech production module; Camera is caught facial image, voice signal identifies semanteme after via voice pickup model and sound identification module, described chat robots is understood people's demand according to people's voice, then form the session statement by the knowledge query module, generate voice by the speech production module again and exchange with the people.

In the present embodiment, chat robots comprises 1 high-performance PC, 2 CCD cameras, 5 DC servo motor, 1 Channel Image tablet and 1 blocks of data capture card, 1 microphone and 2 loudspeakers, 2 CCD cameras have 5 degree of freedom (by 5 DC servo motor controls), can move apish two eyes upper and lower, left and right, also can as people's neck, rotate tracker's face.When the user enters into the camera watch region of 2 cameras, all the time the user is positioned at respectively the capture center of 2 cameras by the webcam driver module, just as people's eyes.Pick up voice signal and convert the laggard lang sound identification of digital signal to by microphone (voice pickup model).Voice pickup model, sound identification module, knowledge query module, speech production module can realize by computing machine.

The sound identification module: sound identification module changes voice signal into corresponding text by identification.At present, most of speech recognition systems have all adopted the principle of pattern match.Around this principle, the pattern of unknown voice will compare one by one with the reference model of known voice, and the reference model of optimum matching is used as recognition result.

Such as Fig. 3, voice to be identified are transformed into through microphone and are added in the recognition system input end among the figure behind the voice signal, pass through first pre-service.Pre-service comprises voice signal sampling, anti aliasing bandpass filter, removes the equipment of individual pronunciation difference, the noise effect that environment causes etc., relates to choosing and the end-point detection problem of speech recognition primitive, sometimes also comprises analog to digital converter.The parameters,acoustic of voice reflection essential characteristic partly be used for is extracted in feature extraction, and feature commonly used has short-time average energy or amplitude, short-time average zero-crossing rate, short-time autocorrelation function, linear predictor coefficient, voiceless sound/voiced sound sign, fundamental frequency, short time discrete Fourier transform, cepstrum, resonance peak etc.Training was carried out before identification, was by allowing the talker repeatedly repeat voice, remove redundant information from the raw tone sample, kept critical data, again by rule to data cluster in addition, form pattern base.Pattern match is the core of whole speech recognition system, is according to certain criterion and expertise, and the similarity between computer input feature and the inventory mode is judged the meaning of one's words information of inputting voice.

Model training refers to according to certain criterion, extracts the model parameter of this pattern feature of expression from a large amount of known mode.Pattern match refers to according to certain criterion, makes a certain model acquisition optimum matching in unknown pattern and the model bank.The model training of main flow and mode-matching technique had following several during voice technology was used:

(1) dynamic time warping coupling (Dynamic Time Warping, DTW) algorithm: Time alignment is to proofread and correct the time, is that the time varying characteristic in the word is become consistent process.In regular process, the time shaft of unknown words will twist or become folding unevenly, in order to make the contrast of its feature and the aspect of model, be a smallest and the most exquisite speech recognition algorithm, its system overhead is little, and recognition speed is fast, efficient is higher in tackling the voice command control system of little vocabulary, but if system is slightly more complex, it is unable to do what one wishes that this algorithm just seems.

(2) Hidden Markov Model (Hidden Markov Model, HMM): adopt the parameter procedure of speech signal time varying characteristic, jointly described the statistical property of signal by two stochastic processes that are mutually related.Adopt this technology of HMM, will be with a system with limited different conditions as speech production model, each state all can produce limited output, until whole word output is complete, transfer between the state is at random, output under each state also is at random, owing to allow to shift at random and at random output, so HMM can adapt to the various delicate variation of pronunciation.The HMM method has solved the difficulty in classification and the training well, and Viterbi (Viterbi) search speech recognition algorithm has solved the normalization problem of time shaft.HMM elongates or shortens unknown quantity equably, until it is when consistent with the length of reference model, this is a kind of very effective measures, and is very effective to the accuracy of identification that improves system.

(3) artificial neural network (Artificial Neural Net ANN): the concept of neural network also has been applied in the speech recognition, wherein the most effective a kind of method is to use multilayer neural network, multilayer neural network is not only inputted node, output node, and one or more layers hidden node is arranged.Utilize memory function and the fast response characteristic of neural network, the eigenwert that voice signal is extracted is input in the neural network trains for a long time, obtains connecting between node weights.Self organizing neural network can be finished the Classification and clustering function to the input sample, but its output layer can not visualize out, need to carry out pattern identification to it., directly it is designated and the corresponding Pattern Class of such input sample the neuron of certain class Sample producing response for only; Then adopt the disposal route sign of above-mentioned borderline neuron for borderline neuron; For the neuron that any input class is not produced response, directly shielding.Like this, when new sample is inputted, just can read the input sample from output layer intuitively and belong to which Pattern Class.

Natural language is human daily used language, is the human mutually sound symbol system of communication that is used for that develops in its social life out, as: Chinese, English, Japanese etc.Natural language is a very complicated notation, and the form of symbol and its expressed meaning are arranged by society, and along with the development of society continuous Change and Development.Natural language understanding is as one of language information processing technology high-level important directions, is one of core topic of paying close attention to of artificial intelligence circle always.From microcosmic, natural language understanding is that natural language system is to the mapping between the system for computer internal representation; On macroscopic view, it refers to that computing machine can carry out human some desired linguistic function according to the rule of some.

Writing in the expression of Chinese, in succession between the words, each word does not have explicit mark in sentence.The top priority of understanding Chinese is exactly the sequence that continuous Chinese character string is divided into word, i.e. Chinese word segmentation.Chinese word segmentation can be divided into following three kinds of forms:

(1) mechanical Chinese word segmentation.Mechanical Chinese word segmentation is based on the string matching principle, needs dictionary for word segmentation as the foundation of participle, and the number of word directly affects accuracy and the efficient of participle in the structure of dictionary and the dictionary.Can be divided into forward scan according to the direction of scanning, reverse scan and bilateral scanning; Can be divided into maximum matching method and smallest match method by matching principle.The mechanical Chinese word segmentation algorithm is simple, and dictionary is set up index, can effectively improve participle speed, but the well disambiguation of this segmenting method, also need and other method combine, and further improve the precision of word segmentation.

(2) statistics participle.The statistics participle take theory of probability as theoretical foundation, with the appearance of Chinese character string in the Chinese language text abstract be a stochastic process, wherein, the parameter in the stochastic process can be trained by large-scale Chinese data storehouse and be drawn.Treat the word string C=c of participle ₁c ₂... c _n, the word string W=w of output ₁w ₂... w _n, m≤n wherein.It is corresponding to have a plurality of W for a specific C, and the task of statistics participle is exactly to find out of maximum probability in these W, namely asks W, makes the value of P (W|C) maximum.Can obtain P (W|C)=P (C|W) * P (W)/P (C) according to Bayesian formula, wherein P (C) is fixed value, returns to the probability P (C|W)=1 of Chinese character string from the word string.Solve problems can be transformed to thus: obtain certain W in all results of full cutting gained, so that P (W) is maximum.The N-gram model is the most basic statistical language model, with binary modular representation P (W) commonly used, i.e. P (W)=P (w ₁) * P (w ₂| w ₁) * ... * P (w _m| w _M-1).

(3) knowledge participle.The knowledge participle is also referred to as regular participle, and it is not only the coupling of using dictionary, also uses the further word segmentation processing of knowledge of grammer, syntax and semantic aspect.The knowledge participle need to design a grammatical and semantic knowledge base, comes word segmentation processing by defined rule in the storehouse.The morphology syntactic rule of Chinese is complicated, and it is large to set up an applicable knowledge base difficulty, and the length of taking time, so the knowledge participle is difficult to so far be applicable to extensive real text and processes remains further research.

Knowledge is the experience that accumulates in the process of transforming the objective world of people and the product of summing up distillation thereof.Knowledge is the basis of all intelligent behaviors, is the important research content of artificial intelligence.Make computing machine have intelligence, just must make it have knowledge.Suitably select and the correct efficient of using knowledge representation method can greatly improve the artificial intelligence problem solving.From the computing machine angle, the word in the natural language and sentence just are kept at symbol string constant isolated in the internal memory, do not have special meaning.If according to certain rule or these character strings of structure organization, convert the structure of being convenient to computer programs process to, after computer program is processed through search, association, judgement, reasoning, substitute etc. so, export in the natural language expressing mode, it is certain intelligent to think that computing machine possesses again.The at present expression of semantic knowledge can be adopted following several method.

(1) logical representation.Use logical approach to represent knowledge, needing will be with the knowledge of natural language description, comes in addition formal description by introducing predicate, function, obtains relevant logical formula, and then with the machine intimate coded representation.Its middle term is the constant of describing the object in the world, comprises abstract things; Predicate is the constant of describing relation and attribute; Logic of relations computing has conjunction (∧), extract (∨), negate (～), condition (→), two condition (); Measure word have generality quantifier (

) and existential quantifier (

).Adopt end or other method to carry out reasoning.

(2) production representation method.The production representation method is described the fact, rule and their uncertainty measure easily.Production system is comprised of knowledge base and inference machine two parts, and wherein knowledge base is comprised of rule base and database.

Rule base is the set of production rule, and database is true set.Rule base is the storer with certain domain knowledge, and rule is to use production representation, is comprising the transformation rule from initial state to final solution state.Database is deposited the fact of input, the fact and the intermediate result of external data base input.Inference machine is control program, comprises inference mode and control strategy.Its inference mode comprises three kinds: forward reasoning, backward reasoning and bidirection reasoning.

The production representation form is fixed, and form is simple, and regular key is mutually comparatively independent, and knowledge base and inference rule is separated, modification knowledge base that can be independently, and institute adopted when production representation was usually used in expert system and builds thus.

(3) semantic network representation.Semantic network is that the tlv triple (node A, arc, mark R, node B) with digraph links, as shown in Figure 2.Node represents concept, things, event, situation etc.; Arc is the directive mark that has, the direction indication primary and secondary, and node A is main, node B is inferior, and mark R represents the attribute of node A or the relation between node A and the node B.

Semantic network can represent between the things succession, replenish, the relations such as variation, refinement, and visual and understandable, be easy to for reasoning, be used widely.

(4) frame representation.The basic concept of Frame Theory is that human brain is to store a large amount of typical scenes, when the people faces new sight, just from memory, select an ABC structure that is called framework, the empty frame of the knowledge that this framework is remembered before being, and concrete content changes with new sight, details processing Revision and supplement to this sky frame forms the understanding of new sight is remembered in human brain again.Framework is the network that is made of several nodes and relation (being referred to as groove), represents structurized a kind of data structure of a certain class sight.Framework is comprised of frame name and some grooves, and each groove has some values, and the value of groove can be logic, digital, can be program, condition, default value or subframe.Frame representation strong adaptability, generality are high, structuring is good, inference mode flexibly, can combine declarative knowledge with procedural knowledge again, but frame method is difficult for expressing procedural knowledge, so in concrete system, it often will be used with other method.

The knowledge query module: finding the solution of artificial intelligence problem is knowledge-based, the knowledge base scale reflects the level of intelligence of computing machine to a certain extent in this module, but human knowledge is vast as the open sea, expression-form is different, under present computer technology restriction, can not all express these knowledge with rule format.

The knowledge base of text chat module can be divided into: dictionary, rule base, semantic knowledge-base and general knowledge storehouse.

Dictionary is mainly used in participle, includes word, the meaning of a word, and part of speech, and the information such as word frequency also can dynamically generate some basic semantic knowledges according to the meaning of a word of dictionary the inside simultaneously.

Rule base is preserved the syntax rule of Chinese sentence, with rule judgment sentence grammaticalness whether, also can come simply constructed sentence according to rule, and rule can dynamically increase.

The semantic knowledge that records in the semantic knowledge-base mainly is exactly semantic relation knowledge, is exactly the huge network of personal connections between the word in essence, can replace word by these networks of personal connections, derives profound semanteme.

The general knowledge storehouse can be the daily used knowledge of people, also can be the professional knowledge of special dimension, and content is the most extensive, and its form can be literal, picture, sound, video etc.Need a large amount of manpower and materials of cost to go the correctness of obtaining, setting up the general knowledge storehouse and guarantee each bar general knowledge, the foundation in general knowledge storehouse is a long-term process thus.The foundation in general knowledge storehouse should be independent of program design, as long as set up the general knowledge storehouse of association area, just chat, education and consulting robot can be applied to this field.Because general knowledge database data amount is huge, how rapid saving is set up index, and the speed that improves data retrieval needs further research.

The foundation of knowledge base mainly contains manual foundation, and computer program automatically sets up and mode is set up in man-machine combination.Some base library, such as dictionary, rule base is set up by manual, also can arrive on the internet and obtain, improves existing knowledge base resource.And those general knowledge storehouses can directly obtain from the internet by computing machine first, pass through hand inspection, modification again, are saved in the database according to certain format.

Knowledge query based on natural language refers to that the user is described query aim in searching system with natural language, system extracts the key feature of querying condition, query aim etc. automatically from query text, search the record that satisfies condition by certain rule and algorithm in database and feed back to the user as Query Result.Knowledge query need to preset one or more specific knowledge storehouse, as: particular professional course, product operation instruction, the rules and regulations of enterprise etc.Different with the chat feature module is that knowledge question is good in knowledge query, and answer is accurate as far as possible, for unanswerable problem, just answers " not knowing ", rather than deliberately diverts the conversation to another topic.

Knowledge query is identical to the sentence pre-service of input with the chat feature module, also needs to carry out first participle, grammatical and semantic analysis.To answer in order making correctly user's enquirement, to need at first to know what the user put question to for, and namely what the type of problem is, also want simultaneously the clear and definite answer that finally provides to satisfy which requirement.

Problem types in the query script: interrogative is the Main Basis of problem identificatioin type and answer requirement, therefore when the problem identificatioin type, at first will find out the interrogative in the question sentence, analyzes possible answer type according to interrogative.But the resolving ability of each interrogative is not identical: be that " place " putd question to as can be known by interrogative " where ", it is " special-purpose interrogative ", if but occur " what " in the sentence, just can not only depend on interrogative to judge type, because very eurypalynous problem has this " general interrogative ", must just can judge soundly by means of another word in the question sentence (being called " problem focus " or " query qualifier ").So-called " problem focus " is exactly noun or the nominal phrase that says something main contents in the problem, and the condition that the answer that to be exactly present embodiment want finds of the main contents of problem need to be satisfied.How to come so to determine " problem focus "? generally speaking, first noun in the problem or noun phrase are that the possibility of problem focus is very large.The question sentence of question answering system generally is by in short consisting of, at first take out nouns all in this, then judge according to the positional information of noun in interrogative and the sentence that by observation and the statistics to a large amount of problems, it is as follows to summarize when containing general interrogative the judgment rule of problem types:

(1) if behind the interrogative followed by noun or noun phrase, then can regard this noun or noun phrase as the problem focus;

(2) if interrogative is in the end of sentence, then can be regarded as the problem focus with nearest noun or the noun phrase of this interrogative;

(3) if be verb (such as " be, for " etc.) behind the interrogative, then last noun or the noun phrase that occurs can be regarded as the problem focus in the sentence.

Table 1 is the corresponding relation that problem types and answer require.

Table 1

Problem types	Relevant interrogative for example	The answer requirement
			Query time	When, what year, when	Answer event information
The inquiry place	Where, which country	Answer location information
			The inquiry personage	Who, who	Answer personage's descriptor
The inquiry reason	Why, why	Must contain cause information
			Inquiry quantity	What, several	Must contain quantity information
The inquiry termini generales	What+termini generales	Description to this noun
			The inquiry state	Many+adjective	Description to state
The inquiry action	How/how+verb	Description to action
			Inquiry definition, event	What	Be necessary for the summary form
Whether inquiry	Whether, whether	Answer is or is not

Voice synthetic module: the frame diagram of voice synthetic module as shown in Figure 4.Phonetic synthesis is that the information that will exist with textual form or other form converts voice signal to, allows the people come acquired information by the sense of hearing.Text-to-speech system (Text-To-Speech System, TTS System), it is a kind of take the speech synthesis system of text strings as input.Its input be common text word string, text analyzer in the system is at first according to Pronounceable dictionary, the text strings of input is decomposed into word and pronunciation symbol thereof with attribute flags, again according to semantic rules and phonetic rules, for stress grade and sentence structure and intonation determined in each word, each syllable, and various pauses etc.Text strings just changes the symbol code string into like this.According to the result of front surface analysis, generate the prosodic features of target voice, adopt synthetic technology to synthesize the output voice.

Based on the difference to the processing mode of synthesis unit, composition algorithm can be divided three classes: 1. voice parameter synthetic (Articulatory Parameter Synthesis); 2. parameter analysis synthetic (Parametric Analysis Synthesis); 3. waveform coding synthesizes (Waveform Coding Synthesis).Wherein, front two kinds of methods all are to be based upon on the basis of sound source-filter model that voice that Fant sets up produce basically, represent respectively three parts of sound source, sound channel filtering, radiation of voice generation with different physics or mathematical model.Rear a kind of method belongs to the statistical model based on linguistic rules in essence.

(1) voice parameter is synthetic.It is synthetic that the research of phonetic synthesis starts from voice parameter.The method is that the pronunciation physiological mechanism is analyzed, and with the various physiological parameters of instrument record vocal organs when sending out the different phonetic unit, therefrom summarizes the required parameter series of control synthetic model.Say in essence, this is a kind of system that can reflect phonetic synthesis essence, but because the physiology of human vocal organs and the nervous system of physical mechanism and the motion of control vocal organs are not well understood fully, so still be in the stage of fumbling based on the synthesis system of voice parameter.

(2) the parameter analysis is synthetic.The parameter analysis is synthetic to be that the natural-sounding of synthesis unit (mainly with syllable, half syllable or phoneme) is analyzed by certain method, obtains the characteristic parameter of this unit and stores, and becomes the sound storehouse; When synthetic, call the characteristic parameter of corresponding synthesis unit and carry out sending into compositor after the conversion according to certain rule, obtain the output of synthetic speech.These class methods are owing to it effectively is widely used in the synthesis system of unlimited vocabulary flexibly.

(3) waveform coding is synthetic.Waveform coding synthetic method based on Big-corpus is just more and more paid close attention to.That the voice unit of synthetic statement is prerecorded from one, through picking out the speech database of compression coding.As long as speech database is enough large, comprised all voice units under the various possibility linguistic context, just might splice by efficient searching algorithm the statement of any high naturalness in theory.Because synthetic speech primitive all is the original transcription from nature, sharpness and the naturalness of synthetic statement all will be very high.But it is too huge that the shortcoming of the method is exactly corpus, so the structure time and effort consuming of sound bank is dumb, and shared storage space is excessive, and the degree of prosody adjustment is extremely limited.The selection of optimum synthesis unit needs the high-level efficiency algorithm just can make system very fluent.

The Speech SDK of Microsoft 5.1 supports the exploitation of Chinese speech application program comprehensively, and speech recognition and Compositing Engine associated component, application program layer interface, technical data and help document are provided in the SDK.It adopts the COM standard development, underlying protocol all is totally independent of application layer with the form of com component, for the application programming personnel mask complicated voice technology, demonstrated fully the advantage of COM, be that the relevant a series of activities of voice is finished by com component: speech recognition is by identification engine (Recognition Engine) management, and phonetic synthesis is responsible for by speech synthesis engine (Synthesis Engine); The programmer only need be absorbed in the application of oneself, calls relevant speech application interface (SAPI) and realizes phonetic function.

The function of speech recognition is coordinated to finish the main interface of speech recognition by a series of com interface:

(1) IspRecognizer interface: be used for creating the example of speech recognition engine, when creating, select the kind of engine by parameter.The identification engine has two kinds: monopolize the engine of (InProc Recognizer) and the engine of shared (Shared Recognizer).The engine object of monopolizing can only be used by the application program that creates, and the engine of sharing can use jointly for a plurality of application programs.

(2) IspRecoContext interface: be mainly used in accepting the event message relevant with speech recognition message with transmission, loading and unloading identification grammer resource.

(3) IspRecoGrammar interface: by this interface, application program can be written into, activate syntax rule, is then defining single word, phrase and sentence of expectation identification in the syntax rule.Two kinds of syntax rules are arranged usually: dictation grammer (Dictation Grammar) and order control grammer (Command and Control Grammar).

(4) IspPhrase interface: be used for obtaining the result of identification, comprise identification literal, identified which bar syntax rule etc.

The function of speech recognition is finished jointly by top com interface, and observes specific working routine.Put it briefly, the principle of work of speech recognition is followed the principle of work of com component and the principle of work of general windows application program (message-driven mechanism), and is specific as follows: initialization COM at first; Then want each speech interface of instantiation (with specific order), identification grammer, identification message are set, make the identification engine in running order; After having syntax rule to be identified, speech interface sends speech recognition message to application program; In the identification message response function, obtain the result of identification by the IspPhrase interface; When application program withdraws from, unloading COM.

Claims

1. chat robots, the computing machine that it is characterized in that comprising camera, webcam driver module, voice pickup model and be used for realizing speech recognition, knowledge query, speech production; Described voice pickup model is microphone, is used for picking up voice signal; Described camera is used for catching facial image; The number of described camera is 2.

2. chat robots according to claim 1 is characterized in that described camera has 5 degree of freedom.