KR20140022320A - Method for operating an image display apparatus and a server - Google Patents
Method for operating an image display apparatus and a server Download PDFInfo
- Publication number
- KR20140022320A KR20140022320A KR1020120089061A KR20120089061A KR20140022320A KR 20140022320 A KR20140022320 A KR 20140022320A KR 1020120089061 A KR1020120089061 A KR 1020120089061A KR 20120089061 A KR20120089061 A KR 20120089061A KR 20140022320 A KR20140022320 A KR 20140022320A
- Authority
- KR
- South Korea
- Prior art keywords
- data
- voice
- signal
- image display
- language model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 106
- 238000011017 operating method Methods 0.000 claims abstract description 6
- 230000004044 response Effects 0.000 claims description 19
- 230000005236 sound signal Effects 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000012545 processing Methods 0.000 description 33
- 230000008569 process Effects 0.000 description 29
- 238000010586 diagram Methods 0.000 description 22
- 238000004891 communication Methods 0.000 description 18
- 230000033001 locomotion Effects 0.000 description 13
- 230000006870 function Effects 0.000 description 12
- 238000003058 natural language processing Methods 0.000 description 9
- 230000008859 change Effects 0.000 description 7
- 239000011521 glass Substances 0.000 description 7
- 238000007726 management method Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 238000012790 confirmation Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000011230 binding agent Substances 0.000 description 4
- 239000000470 constituent Substances 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000003825 pressing Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/441—Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
- H04N21/4415—Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Controls And Circuits For Display Device (AREA)
Abstract
Description
The present invention relates to an image display apparatus, a server, and an operation method thereof, and more particularly, to an image display apparatus, a server, and an operation method capable of providing an efficient voice recognition method and resource management.
A video display device is a device having a function of displaying an image that a user can view. The user can view the broadcast through the video display device. A video display device displays a broadcast selected by a user among broadcast signals transmitted from a broadcast station on a display. Currently, broadcasting is shifting from analog broadcasting to digital broadcasting worldwide.
Digital broadcasting refers to broadcasting in which digital video and audio signals are transmitted. Digital broadcasting is more resistant to external noise than analog broadcasting, so it has less data loss, is advantageous for error correction, has a higher resolution, and provides a clearer picture. Also, unlike analog broadcasting, digital broadcasting is capable of bidirectional service.
On the other hand, according to the diversification of functions and services of the image display apparatus, researches on a voice recognition technology capable of recognizing and manipulating user voices such as menu selection, text input, command input, and channel switching are increasing.
SUMMARY OF THE INVENTION An object of the present invention is to provide an image display apparatus, a server, and a method of operating the same, which can be accurately and conveniently operated and efficiently manage internal resources through voice recognition technology.
Another object of the present invention is to provide an image display apparatus, a server, and an operation method thereof, which may improve user convenience.
According to an aspect of the present invention, there is provided a method of operating an image display device, the method comprising: requesting data related to a language model from an external electronic device connected through a network to receive data related to a language model from an external electronic device. And updating the language model stored in the voice database based on the received data.
In addition, the operation method of the server according to an embodiment of the present invention for achieving the above object, receiving a request for data related to the language model for speech recognition from the image display device, response data to the image display device in accordance with the request And updating the database based on the request and response data.
According to the present invention, it is possible to operate accurately and conveniently through the voice recognition technology and to efficiently manage internal resources, thereby improving user convenience.
1A and 1B illustrate an image display system according to an exemplary embodiment of the present invention.
2 is an internal block diagram of an image display apparatus according to an embodiment of the present invention.
3 is an example of an internal block diagram of the control unit of FIG.
4 is an example of an internal block diagram of the server of FIG.
5 is a diagram illustrating a control method of the remote controller of FIG. 1.
6 is a perspective view of a remote control apparatus according to an embodiment of the present invention.
7 is an internal block diagram of a remote control apparatus according to an embodiment of the present invention.
8 and 9 are views referred to for describing a speech recognition process according to an embodiment of the present invention.
10A to 10B illustrate various examples of a platform structure diagram of the image display apparatus of FIG. 1.
11 is a diagram showing a platform structure according to an embodiment of the present invention.
12 is a flowchart illustrating a method of operating an image display apparatus according to an exemplary embodiment of the present invention.
13 is a flowchart illustrating a method of operating a server according to an exemplary embodiment of the present invention.
14 is a flowchart illustrating a method of operating an image display apparatus system according to an exemplary embodiment of the present invention.
15 is a diagram referred to describe an example of an operating method of an image display device system according to an exemplary embodiment of the present invention.
16 is a flowchart illustrating a method of operating an image display apparatus system according to an exemplary embodiment of the present invention.
17 to 24 are views referred to for describing various examples of an operating method of an image display device system according to an embodiment of the present invention.
Hereinafter, the present invention will be described in more detail with reference to the drawings.
The suffixes "module", "engine" and "part" for components used in the following description are merely given in consideration of ease of preparation of the present specification, and do not give particular meanings or roles by themselves. . Therefore, the "module", "engine" and "unit" may be used interchangeably.
1A and 1B illustrate an image display system according to an exemplary embodiment of the present invention.
Referring to FIG. 1A, an
The
The
On the other hand, the
The voice database may include an acoustic model database and a language model database that can store an acoustic model and a language model and store an acoustic model and a language model, respectively.
The voice database may further include a pronunciation dictionary database for storing vocabularies and corresponding pronunciation symbols. According to an embodiment, the speech recognition engine may further comprise a pronunciation symbol generation module for generating a pronunciation symbol from the received text data.
The speech recognition engine processes the received voice signal, and outputs the voice recognition result data by comparing the data with data stored in the voice database. On the other hand, the
Meanwhile, the
The
In addition, data for content reproduction, data including data related to the content or the
Referring to FIG. 1B, the
Meanwhile, in the present specification, the
For example, the storage means stores and manages data related to speech recognition, and a computer, network connection, computer, other video display device, smart phone, tablet PC, etc. are connected to the
2 is an internal block diagram of an image display apparatus according to an embodiment of the present invention.
Referring to FIG. 2, the
The
The
For example, if the selected RF broadcast signal is a digital broadcast signal, it is converted into a digital IF signal (DIF). If the selected RF broadcast signal is an analog broadcast signal, it is converted into an analog baseband image or voice signal (CVBS / SIF). That is, the
The
Meanwhile, the
On the other hand, the
The
The
The stream signal output from the
The external
The external
The A / V input / output unit can receive video and audio signals from an external device. Meanwhile, the wireless communication unit can perform short-range wireless communication with other electronic devices.
The
The
In addition, the
Meanwhile, the
In addition, the
Although the
The user
(Not shown), such as a power key, a channel key, a volume key, and a set value, from the
The
The video signal processed by the
The audio signal processed by the
Although not shown in FIG. 2, the
In addition, the
In addition, the
Meanwhile, the
Meanwhile, the
Such a 3D object may be processed to have a different depth than the image displayed on the
On the other hand, the
Although not shown in the drawing, a channel browsing processing unit for generating a channel signal or a thumbnail image corresponding to an external input signal may be further provided. The channel browsing processing unit receives the stream signal TS output from the
At this time, the thumbnail list may be displayed in a simple view mode displayed on a partial area in a state where a predetermined image is displayed on the
The
The
In order to view the three-dimensional image, the
The single display method can implement a 3D image only on the
Meanwhile, the additional display method may implement a 3D image by using an additional display as a viewing device in addition to the
On the other hand, the glasses type can be further divided into a passive type such as a polarizing glasses type and an active type such as a shutter glass type. Also, the head mount display type can be divided into a passive type and an active type.
Meanwhile, the
The
A photographing unit (not shown) photographs the user. The photographing unit (not shown) may be implemented by a single camera, but the present invention is not limited thereto, and may be implemented by a plurality of cameras. On the other hand, the photographing unit (not shown) may be embedded in the
The
The
The
Meanwhile, the
Meanwhile, a block diagram of the
On the other hand, the
In the following, an embodiment of the present invention will be described with reference to the
3 is an example of an internal block diagram of the control unit of FIG.
The
The
The
The video decoder 225 decodes the demultiplexed video signal and the scaler 235 performs scaling so that the resolution of the decoded video signal can be output from the
The video decoder 225 may include a decoder of various standards.
On the other hand, the image signal decoded by the
For example, when an external video signal input from the external device 190 or a broadcast video signal of a broadcast signal received from the
Meanwhile, the image signal decoded by the
The
In addition, the
In addition, the
In addition, the
The
The
The
The
A frame rate converter (FRC) 350 can convert the frame rate of an input image. On the other hand, the
The
The
In the present specification, a 3D video signal means a 3D object. Examples of the 3D object include a picuture in picture (PIP) image (still image or moving picture), an EPG indicating broadcasting program information, Icons, texts, objects in images, people, backgrounds, web screens (newspapers, magazines, etc.).
On the other hand, the
Meanwhile, the
Although not shown in the drawing, it is also possible that a 3D processor (not shown) for 3-dimensional effect signal processing is further disposed after the
Meanwhile, the audio processing unit (not shown) in the
In addition, the audio processing unit (not shown) in the
The data processing unit (not shown) in the
In FIG. 3, the signals from the
Meanwhile, the block diagram of the
In particular, the
4 is an example of an internal block diagram of the server of FIG.
Referring to FIG. 4, the
The
The
The data related to speech recognition may be an acoustic model, a language model, a pronunciation dictionary data used in the speech recognition process, and a voice signal received by the
The broadcast program related data may include detailed information of the broadcast program and data including additional information, transport stream data for reproduction of the broadcast program, or may include data transcoded in another manner.
On the other hand, the
The
The
Meanwhile, the
In addition, the
The
Meanwhile, the
5 is a diagram illustrating a control method of the remote controller of FIG. 1.
5A illustrates that the
The user can move or rotate the
5B illustrates that when the user moves the
Information on the motion of the
5C illustrates a case in which the user moves the
On the other hand, when the specific button in the
On the other hand, the moving speed and moving direction of the
6 is a perspective view of a remote control apparatus according to an embodiment of the present invention, Figure 7 is an internal block diagram of the remote control apparatus.
Referring to FIG. 6, the spatial remote control 201 according to an embodiment of the present invention may include various input keys or input buttons.
For example, the spatial remote controller 201 may include an Okay key 291, a
For example, the Okay key 291 may be used to select a menu or item, the
In addition, the spatial remote controller 201 may further include a
On the other hand, as shown in the figure, the
In detail, when an image larger than the size of the display is displayed on the
This scroll function may be provided with a separate key other than the
On the other hand, the four
Referring to FIG. 7, the
The
Meanwhile, the coordinate
In the present embodiment, the
Further, the
The
Also, the
On the other hand, according to the present embodiment, the
On the other hand, the
The
The
For example, the
The
For example, the
The
The
In addition, the
The
The
Meanwhile, the voice signal and data may be transmitted to the
8 and 9 are views referred to for describing a speech recognition process according to an embodiment of the present invention.
Referring to FIG. 8A, when the user enters the voice channel switching mode, the
The
On the other hand, the voice channel switching mode may be entered in a variety of ways, such as pressing one of the hard keys provided on the
When the user inputs voice "Kbc" through the microphone of the remote controller as shown in FIG. 8 (b), the user may recognize the input voice signal and switch the channel to the matching "Kbc" channel.
9 briefly illustrates a configuration and operation of an example of a speech recognition engine.
Referring to FIG. 9, a preprocessing such as noise processing is performed on a first received
A commonly used method is to determine the speech interval and the silence interval by comparing the energy value (or log energy value) at every interval of the input signal and comparing it with a predetermined threshold value by statistics.
In addition, the preprocessing process may include noise processing to remove noise.
Meanwhile, the
Thereafter, a feature vector (parameter) effective for recognition is extracted from the input speech signal.
Here, a method based on LPC (Linear Predictive Coefficients) and an MFFC (Mel Frequency Cepstral Coefficients) extraction method can be used.
Thereafter, a voice signal and a pattern of the extracted feature parameter may be recognized (940), and an
This is a method of modeling and comparing the signal characteristics of speech, and a direct comparison method of setting the recognition object as the feature vector model and comparing it with the feature vector of the input signal can be used. In the direct comparison method, a unit of a recognition target word, a phoneme, and the like is set as a feature vector model, and it is possible to compare how similar the input speech is.
Alternatively, a statistical method of statistically processing and using the feature vector of the recognition object can be used. This statistical method can construct the unit of the recognition target as a state sequence and use the relation between the status columns. The DTW (Dynamic Time Warping) method using the temporal arrangement relation, the probability value, And HMM (Hidden Markov Model) method.
In more detail, DTW (Dynamic Time Warping) is a method of obtaining the distance between the reference speech signal and the input speech signal using dynamic programming, and is mainly used for constructing a speaker-dependent isolated word recognition system and has a high recognition rate.
HMM (Hidden Markov Model) is a method of expressing transition probability of a negative state from one state to the next state, constructing a model representative of these from training data by using temporal statistical characteristics of a speech signal, And a probability model with a high degree of similarity is adopted as the recognition result.
Thereafter, data corresponding to the received voice signal may be determined, and a recognition result 1960 may be output.
On the other hand, the data determination process may include an operation of the
The language model is generally used to find probability values for all possible word sequences. The grammar-based method considers only word sequences that are grammatically correct for a given situation among the possible combinations of words. From a database of uttered speech in a given situation, , A statistical-based scheme may be used that statistically estimates a probability value of a possible word sequence.
10A to 10B illustrate various examples of a platform structure diagram of the image display apparatus of FIG. 1.
The platform of the
First, referring to FIG. 10A, the platform of the
The
The
The hardware driver in the
In addition, the hardware driver in the
The
The
In addition, the
The
Examples of the
The
In addition, the
Meanwhile, the
The
The virtual machine (VM) can be a plurality of instances, that is, a virtual machine that can perform multitasking. Meanwhile, each virtual machine (VM) may be allocated and executed according to each application in the
The binder driver and the runtime 1032 may connect a Java-based application and a C-based library.
The
The
The
In addition, the
Through the application in this
Next, referring to FIG. 10B, the platform of the
The platform of FIG. 10B is different from that of the
On the other hand, the
The
On the other hand, the
The platform of FIGS. 10A and 10B described above may be used in a variety of electronic devices as well as an image display device.
Meanwhile, the platform of FIGS. 10A and 10B may be loaded in the
11 is a diagram illustrating a platform structure according to an embodiment of the present invention.
In more detail, the configuration of the
The embedded voice engine 1110 receives the channel name and the channel information to store information necessary for channel switching, for example, a channel table 1114 for storing the physical and logical addresses of the channel (Physical Number, Major Number, Minor Number). Can be configured.
Meanwhile, the channel table 1114 may be provided inside the voice engine 1110 or separately provided.
The
Meanwhile, unlike FIG. 11, the
The
Meanwhile. Upon entering the voice channel switching mode, the
Thereafter, the
12 is a flowchart illustrating a method of operating an image display apparatus according to an exemplary embodiment of the present invention, and FIG. 13 is a flowchart illustrating a method of operating a server according to an exemplary embodiment of the present invention.
Referring to FIG. 12, the
Meanwhile, the
As described above with reference to FIG. 9, the voice recognition compares the input voice with data of a stored acoustic model, a language model, and the like, and outputs the most similar data as the recognition result data.
For example, the embedded speech engine included in the
The
However, for accurate speech recognition, the amount of data in the speech database should be large. In this case, not only the storage space is consumed, but also the search area is large, which unnecessarily consumes the resources of the image display device. This increases, and the recognition speed may slow down.
In particular, with the increase of natural language search and the like, the language model database has a significant effect on speech recognition performance. Both statistic-based language models such as n-gram models or grammar-based language models such as context-free grammars should have sufficient language models for accurate and fast speech recognition.
Therefore, the present invention does not always store and maintain the voice database used for speech recognition to the maximum, but requests for necessary data from the server when necessary, and performs an update for replacing or adding a part of the voice database to the received data. do.
Accordingly, it is possible to efficiently manage memory and resources, minimize the database search area and time in the speech recognition process, and improve the speech recognition speed.
On the other hand, the updating step (S1230), it may be characterized in that the grammar (context) or context (context) data is generated dynamically based on the received data. That is, the received data may be at least a part of grammar or context data or data required for generation, and the
The grammar or context data is data included in a language model, and grammar may define rules and orders inherent in a sentence. For example, the grammar may be defined as a BNF (Backus Naur Form) grammar. In addition, context is a general term of semantic and logical relations established between the components of a sentence, and can represent the front and rear linkage of the vocabulary.
Meanwhile, the speech recognition engine may analyze syntax and semantic analysis of the input speech using the grammar and context, and determine the most similar data as the speech recognition result.
Meanwhile, in the case of speech recognition using an embedded speech engine, a word or sentence that attempts speech recognition may not be recognized unless it is previously defined in a language model, in particular, a context. Thus, the embedded speech engine can use pre-generated (compiled) binary context data at speech mode entry or prior to system startup or boot.
According to the present invention, when the image display apparatus is driven, various kinds of information suitable for the characteristics of the image display apparatus may be taken to dynamically update a language model, particularly context data.
The context data may be generated in a language model or compiled into binary data based on the grammar data and the pronunciation dictionary data.
Meanwhile, the context data may be updated when the voice signal is input or when the voice mode is entered.
Alternatively, the updating of the context data may be performed at a predetermined time when the image display apparatus is turned on by setting.
The method of operating an image display device according to an exemplary embodiment of the present invention may include receiving a voice signal, extracting a feature vector based on the received voice signal, and comparing the feature vector with the voice database. The method may further include determining data corresponding to the received voice signal. Here, the voice signal may be received through a remote control device.
The data request step associated with the language model may include requesting a language model associated with data corresponding to the feature vector or the received voice signal. That is, when requesting data related to the language model to the server, the received voice signal may be transmitted to the server, or the data generated during the voice recognition process may be transmitted to the server to request a language model for speech recognition.
Alternatively, the data request step associated with the language model may be a request for a language model related to content currently in use.
Through the
The probability of the vocabulary related to the image or the content displayed on the screen is higher than the probability of the vocabulary not related to the screen. As such, by using a language model having a high probability of the vocabulary related to the content, it is possible to increase the accuracy of speech recognition and to improve the speech recognition speed.
For example, when a user accesses a shopping site using an image display apparatus, the image display apparatus transmits URL information, displayed image or text information of the shopping site to a server, and searches for a product mainly used in connection with shopping. In addition, the user may receive grammar or context data including a predefined vocabulary and sentences related to a shopping-related search such as an order or payment.
Alternatively, the
In this case, at least some of voice information such as a language model and a word dictionary related to the additional information may be received together and stored in the embedded voice engine. Thereafter, the user may input a command by voice while watching a broadcast.
Meanwhile, the present invention may grasp content by using various well-known ACR (Auto content recognition) technology and request voice information related to the content from the server.
The
That is, a language model for speech recognition may be dynamically generated according to the user's intention, and the speed of speech recognition and the accuracy of the speech recognition result may be improved.
The method of operating an image display device according to an exemplary embodiment of the present invention may further include selecting one or more external electronic devices to request the data from among a plurality of external electronic devices connected through the network.
That is, the
On the other hand, if it fails to receive the necessary data may be able to request data again from another server.
In addition to the electronic devices connected through the network, the update may be performed through an external storage medium connected through the external
On the other hand, the operation method of the image display apparatus according to an embodiment of the present invention, further comprising the step of receiving a voice signal and using the voice database, the step of recognizing the voice signal, the data request associated with the language model The method may include requesting a language model including data corresponding to the voice signal from the external electronic device when the recognition of the voice signal fails or the confidence value of the processed recognition result is lower than a reference value. have.
That is, the data related to the language model may be requested and the language model may be updated only when the speech recognition of the embedded speech recognition engine fails or when it is determined that the result is not satisfactory.
The confidence value may be determined, for example, as a difference or distance between a feature vector of the input voice signal and a nearest vector corresponding to the voice recognition result data.
Alternatively, the operation method of the image display apparatus according to an embodiment of the present invention, the step of receiving a voice signal, using the voice database, the step of recognizing the voice signal, the recognition of the voice signal fails, or the recognition result When the confidence value of is less than or equal to the reference value, transmitting the data based on the voice signal to the external electronic device, and receiving the voice recognition result data from the external electronic device.
That is, when speech recognition of an embedded speech recognition engine fails or the result is determined to be not satisfactory, the speech recognition may be requested to the external electronic device and the data may be received as a result.
The data based on the voice signal may be a feature vector extracted from the voice signal or the voice signal or the recognition result. That is, the received voice signal may be transmitted to the server, or the feature vector or recognition result by voice recognition of the embedded speech recognition engine may be transmitted to the server.
Alternatively, the operation method of the image display apparatus according to an embodiment of the present invention, receiving a voice signal, using the voice database, recognizing the voice signal, the data based on the voice signal to the external electronic device Transmitting a voice signal, receiving voice recognition result data from the external electronic device, recognizing the voice signal using the voice database, or if a confidence value of the recognition result is equal to or less than a reference value. The method may further include using the result data.
In this embodiment, the received voice signal is processed by the speech recognition engine while transmitting related data to the external electronic devices in parallel. If the speech recognition result of the speech recognition engine is less than or equal to the reference value, the speech recognition result data may be used as the response data of the external electronic device.
Meanwhile, in the data request step S1210 related to the language model, data including sentence structure information of the voice database may be transmitted. Here, the sentence information is a type classified by classifying the arrangement and the combination of language elements in a sentence and may be part of the context data of the embedded voice database.
On the other hand, the received data in the receiving step (S1220) may be data containing a high frequency or high similarity in meaning structure used with the words included in the sentence information.
Referring to FIG. 13, in the method of operating an external electronic device, in particular, the
On the other hand, in the update step (S1330), the request details and response details can be stored in association with the identification information of the video display device. The
The response data may be a language model including data corresponding to data included in the request. That is, the request data may include various data according to the embodiment. For example, the voice signal itself may be received, a signal from which noise is removed, or a signal of various stages such as a feature vector may be included. In this case, the
Meanwhile, the method of operating a server according to an exemplary embodiment of the present invention may further include receiving data based on the voice signal from the image display apparatus and determining voice recognition result data corresponding to the data based on the voice signal. The method may further include storing at least one of a confidence value, a frequency of use, and a retry rate of the speech recognition result data.
On the other hand, the request receiving step (S1310), may receive data including the sentence pattern information of the audio database provided in the image display device, in this case, the
14 is a flowchart illustrating a method of operating an image display device system according to an exemplary embodiment of the present invention, and FIG. 15 is a view referred to for describing an example of an operation method of an image display device system according to an exemplary embodiment of the present invention.
Referring to the drawings, when the
The
On the other hand, the
In some embodiments, the
In this case, the image display apparatus stores the received data, generates a pronunciation symbol and a pronunciation dictionary for a string included in the received data, and then based on the generated pronunciation symbol and pronunciation dictionary, Generate grammar or context data.
FIG. 15A illustrates a part of a string received from a server, and FIG. 15B illustrates phonetic symbols and pronunciation dictionary data for the string illustrated in FIG. 15A.
Meanwhile, grammar or context data may be generated as binary data and updated as shown in FIG. 15C.
16 is a flowchart illustrating a method of operating an image display apparatus system according to an exemplary embodiment of the present invention.
When the
On the other hand, when the confidence value of the recognition result output by the speech engine is greater than the predetermined reference value R, the corresponding operation, for example, text input or channel switching is performed using the output data of the speech engine as the speech recognition result. Or an operation such as executing a predetermined command can be executed (S1640).
On the other hand, when the speech recognition fails or when the confidence value of the recognition result output by the speech engine is equal to or less than the predetermined reference value R (S1650), the
Meanwhile, the
Thereafter, the
17 to 24 are views referred to for describing various examples of an operating method of an image display device system according to an embodiment of the present invention.
Referring to FIG. 17, the
On the other hand, when the speech recognition of the embedded speech engine fails, it is possible to proceed with the server speech recognition process. Alternatively, when the confidence value of the result of the embedded speech recognition is less than the reference value, the embedded process may be terminated and the result of the server speech recognition process may be awaited. Meanwhile, the confidence value may be an evaluation value of the system for the speech recognition result.
Since the response time of the embedded voice engine is faster than that of the server, the response time of the embedded voice engine can be expected to be faster.
For example, a language model related to control of a video display device such as channel switching and volume control, a language model related to broadcast content such as a channel name and a popular program name are stored in an internal audio database, and a corresponding audio input is performed. Voice recognition and command operation can be performed quickly. On the other hand, other voice input may use an external server.
While watching a predetermined channel with the
On the other hand, as shown in (b) of FIG. 18, in the case of inputting a search word, since it is difficult for a user to grasp the contents to be input by text in advance, a language model or a voice recognition may be requested to the server after the user's voice input.
On the other hand, the
Meanwhile, the
The
On the other hand, natural language processing (NLP) is a technology that deals with the understanding and analysis of human language by using speech-to-text (STT) results. Operation may be required.
Therefore, it is more effective to use an external server for such natural language processing.
19 to 21, the
On the other hand, the
Meanwhile, the XML parser of the speech engine parses only a portion of the received XML document that requires an operation. As shown in FIG. 20, the channel switching may be performed after the TV determines that the channel switching to the MBC channel corresponds to the voice input intention of the user.
On the other hand, in order to perform scheduled recording of a predetermined broadcast program, the user must select a program name, date, start time, end time, etc., but the system understands and analyzes the meaning of the voice input requesting recording and the user intends to You can run one operation at a time.
When the user performs a voice input "Record this week's infinite challenge" through the
However, according to the embodiment of the present invention, as shown in FIG. 21, the voice recognition result processed by the internal speech engine or an external STT server is transmitted to the natural language processing server, and the type of command determined by the natural language processing server and The broadcast program information retrieved in association with each other may be received.
The
The
In this case, at least some of voice information such as a language model and a word dictionary related to the broadcasting related information may be received together and stored in the embedded voice engine. Thereafter, the user may input a command by voice while watching a broadcast, and the embedded voice engine may perform voice recognition more quickly and efficiently based on the received and updated data.
Alternatively, when the content currently being used by the user through the
For example, the user may also receive related word dictionaries and language models while receiving, as additional information, information related to the actor appearing in the broadcast program being watched by the user as information on other actors' works.
Alternatively, when a user uses a specific program such as a word processor through the
Alternatively, the present invention may grasp content by using various known ACR (Auto content recognition) technology and request voice information related to the content from the server.
Meanwhile, the
19 to 21 illustrate an example of using a natural language processing server, the present invention is not limited thereto. As shown in the above-described embodiments, the
According to an exemplary embodiment of the present invention, the pointer displayed on the screen may be moved using the
Referring to FIG. 22A, when a user designates a
As shown in (b) of FIG. 22, when the user inputs a voice of "search", the
Alternatively, as shown in FIG. 23, the
Also in this case, similarly to the above-described embodiment, when the user is watching an image or designates a predetermined region, the image display apparatus may use a language model corresponding to the image attribute, for example, "edit", "save", " A language model including a vocabulary and sentences such as "search", "upload", "cut", "paste", and "note" may be requested and received from the server.
The
In addition, as shown in FIG. 24, another program may be driven based on a voice input of a user.
If a user inputs "edit" in a state where the user is viewing the Internet screen 2410 through the
Also in this case, similarly to the above-described embodiment, while driving the word processor program, the
According to the present invention, the voice database used for speech recognition is not always stored and maintained at maximum. Instead, a dynamic update requesting data required by the server when necessary and substituting or adding a part of the voice database to the received data is performed. Can be done.
Accordingly, it is possible to efficiently manage memory and resources, minimize the database search area and time in the speech recognition process, and improve the speech recognition speed.
Therefore, it is possible to operate the video display device accurately and conveniently through the voice recognition technology, and to efficiently manage internal resources, thereby improving user convenience.
The method of operating the image display device and the server according to the embodiment of the present invention is not limited to the configuration and method of the embodiments described above, but the embodiments may be modified in various ways. All or some of these may optionally be combined.
On the other hand, the operating method of the image display device and the server of the present invention can be implemented as processor-readable code in a processor-readable recording medium provided in the image display device and the server. The processor-readable recording medium includes all kinds of recording apparatuses in which data that can be read by the processor is stored. Examples of the recording medium that can be read by the processor include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may also be implemented in the form of a carrier wave such as transmission over the Internet . In addition, the processor-readable recording medium may be distributed over network-connected computer systems so that code readable by the processor in a distributed fashion can be stored and executed.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the present invention.
Claims (21)
Requesting data associated with the language model from an external electronic device connected through a network;
Receiving data related to the language model from the external electronic device; And
Updating the language model stored in the voice database based on the received data.
And wherein said updating step dynamically generates grammar or context data based on the received data.
Receiving a voice signal;
Extracting a feature vector based on the received speech signal; And
And comparing the feature vector with the speech database to determine data corresponding to the received speech signal.
And the voice signal is received through a remote control device.
The requesting data related to the language model may include requesting a language model related to data corresponding to the feature vector or the received voice signal.
The requesting data related to the language model may include requesting a language model related to content currently being used.
Storing the received data;
Generating a phonetic symbol and a phonetic dictionary for a string included in the received data;
And generating grammar or context data based on the generated phonetic symbols and phonetic dictionaries.
Selecting one or more external electronic devices to request the data from among a plurality of external electronic devices connected through the network.
Receiving a voice signal; and
Recognizing the voice signal using the voice database;
The data request step associated with the language model may include: a language model including data corresponding to the voice signal with the external electronic device when the recognition of the voice signal fails or the confidence value of the processed recognition result is equal to or less than a reference value. Requesting an image display device.
Receiving a voice signal;
Recognizing the speech signal using the speech database;
Transmitting data based on the voice signal to the external electronic device when the recognition of the voice signal fails or the confidence value of the recognition result is equal to or less than a reference value; And
And receiving voice recognition result data from the external electronic device.
And the data based on the speech signal is a feature vector extracted from the speech signal or the speech signal or the recognition result.
Receiving a voice signal;
Recognizing the speech signal using the speech database;
Transmitting data based on the voice signal to the external electronic device;
Receiving voice recognition result data from the external electronic device; And
Using the received voice recognition result data when the recognition of the voice signal fails using the voice database or the confidence value of the recognition result is lower than a reference value.
And requesting data related to the language model comprises transmitting data including sentence structure information of the speech database.
And the received data is data including words frequently used in conjunction with words included in the sentence information or words having high similarity in meaning structure.
Transmitting response data to the video display device according to the request; And
Updating a database based on the request and response data.
The updating may include storing request details and response details in association with identification information of the image display apparatus.
And the response data is a language model including data included in the request.
Receiving data based on the audio signal from the image display device;
And determining voice recognition result data corresponding to the data based on the voice signal.
And storing at least one of a confidence value, a frequency of use, and a retry rate of the speech recognition result data.
The request receiving step may include receiving data including sentence structure information of a voice database provided in the image display apparatus.
And searching for a word having a frequency or semantic similarity higher than a reference value used with the words included in the sentence information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120089061A KR20140022320A (en) | 2012-08-14 | 2012-08-14 | Method for operating an image display apparatus and a server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120089061A KR20140022320A (en) | 2012-08-14 | 2012-08-14 | Method for operating an image display apparatus and a server |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20140022320A true KR20140022320A (en) | 2014-02-24 |
Family
ID=50268347
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020120089061A KR20140022320A (en) | 2012-08-14 | 2012-08-14 | Method for operating an image display apparatus and a server |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20140022320A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106131692A (en) * | 2016-07-14 | 2016-11-16 | 广州华多网络科技有限公司 | Interactive control method based on net cast, device and server |
US10134387B2 (en) | 2014-11-12 | 2018-11-20 | Samsung Electronics Co., Ltd. | Image display apparatus, method for driving the same, and computer readable recording medium |
KR20190096856A (en) | 2019-07-30 | 2019-08-20 | 엘지전자 주식회사 | Method and apparatus for recognizing a voice |
WO2020096073A1 (en) * | 2018-11-05 | 2020-05-14 | 주식회사 시스트란인터내셔널 | Method and device for generating optimal language model using big data |
WO2020122274A1 (en) * | 2018-12-11 | 2020-06-18 | 엘지전자 주식회사 | Display device |
WO2020122271A1 (en) * | 2018-12-11 | 2020-06-18 | 엘지전자 주식회사 | Display device |
US11205415B2 (en) | 2018-11-15 | 2021-12-21 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
WO2023167399A1 (en) * | 2022-03-04 | 2023-09-07 | 삼성전자주식회사 | Electronic device and control method therefor |
-
2012
- 2012-08-14 KR KR1020120089061A patent/KR20140022320A/en not_active Application Discontinuation
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10134387B2 (en) | 2014-11-12 | 2018-11-20 | Samsung Electronics Co., Ltd. | Image display apparatus, method for driving the same, and computer readable recording medium |
CN106131692A (en) * | 2016-07-14 | 2016-11-16 | 广州华多网络科技有限公司 | Interactive control method based on net cast, device and server |
WO2020096073A1 (en) * | 2018-11-05 | 2020-05-14 | 주식회사 시스트란인터내셔널 | Method and device for generating optimal language model using big data |
CN112997247A (en) * | 2018-11-05 | 2021-06-18 | 株式会社赛斯特安国际 | Method for generating optimal language model using big data and apparatus therefor |
US11205415B2 (en) | 2018-11-15 | 2021-12-21 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
US11615780B2 (en) | 2018-11-15 | 2023-03-28 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
US11961506B2 (en) | 2018-11-15 | 2024-04-16 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
WO2020122274A1 (en) * | 2018-12-11 | 2020-06-18 | 엘지전자 주식회사 | Display device |
WO2020122271A1 (en) * | 2018-12-11 | 2020-06-18 | 엘지전자 주식회사 | Display device |
KR20190096856A (en) | 2019-07-30 | 2019-08-20 | 엘지전자 주식회사 | Method and apparatus for recognizing a voice |
US11250843B2 (en) | 2019-07-30 | 2022-02-15 | Lg Electronics Inc. | Speech recognition method and speech recognition device |
WO2023167399A1 (en) * | 2022-03-04 | 2023-09-07 | 삼성전자주식회사 | Electronic device and control method therefor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20140022320A (en) | Method for operating an image display apparatus and a server | |
CN115145529B (en) | Voice control device method and electronic device | |
JP5746111B2 (en) | Electronic device and control method thereof | |
JP5819269B2 (en) | Electronic device and control method thereof | |
JP6111030B2 (en) | Electronic device and control method thereof | |
JP6603754B2 (en) | Information processing device | |
KR102527082B1 (en) | Display apparatus and the control method thereof | |
US20130041665A1 (en) | Electronic Device and Method of Controlling the Same | |
US11449307B2 (en) | Remote controller for controlling an external device using voice recognition and method thereof | |
CN110737840A (en) | Voice control method and display device | |
US20130169524A1 (en) | Electronic apparatus and method for controlling the same | |
JP2013037689A (en) | Electronic equipment and control method thereof | |
JP2014532933A (en) | Electronic device and control method thereof | |
KR20130018464A (en) | Electronic apparatus and method for controlling electronic apparatus thereof | |
CN112163086B (en) | Multi-intention recognition method and display device | |
CN112000820A (en) | Media asset recommendation method and display device | |
CN112885354B (en) | Display device, server and display control method based on voice | |
CN111625716B (en) | Media asset recommendation method, server and display device | |
CN111866568B (en) | Display device, server and video collection acquisition method based on voice | |
CN112511882A (en) | Display device and voice call-up method | |
CN112182196A (en) | Service equipment applied to multi-turn conversation and multi-turn conversation method | |
US20230282209A1 (en) | Display device and artificial intelligence server | |
EP3660841B1 (en) | Multimedia device for processing voice command | |
US10755475B2 (en) | Display apparatus and method of displaying content including shadows based on light source position | |
CN112256232B (en) | Display device and natural language generation post-processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |