CN113228165A - Method, apparatus and application for generating answer output in response to speech input information - Google Patents

Method, apparatus and application for generating answer output in response to speech input information Download PDF

Info

Publication number
CN113228165A
CN113228165A CN201980084669.4A CN201980084669A CN113228165A CN 113228165 A CN113228165 A CN 113228165A CN 201980084669 A CN201980084669 A CN 201980084669A CN 113228165 A CN113228165 A CN 113228165A
Authority
CN
China
Prior art keywords
information
output
response
entity
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980084669.4A
Other languages
Chinese (zh)
Inventor
F·加莱茨卡
J·罗泽
S·乔丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Volkswagen Automotive Co ltd
Original Assignee
Volkswagen Automotive Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Volkswagen Automotive Co ltd filed Critical Volkswagen Automotive Co ltd
Publication of CN113228165A publication Critical patent/CN113228165A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Method, apparatus and application for generating a response output in response to speech input information the invention relates to a method for generating a response output (a) in response to speech input information (E), the method having: a) determining whether the speech input information (E) comprises at least one predefined entity; and when this is the case: b) generating a set of speech information, in which the entity is not comprised, based on the speech input information (E); c) generating an information set, wherein the information set comprises information in a knowledge database (14) assigned to the entity; d) with a first analysis unit (22): determining from the set of information at least one piece of target information important in respect of the entity; e) with a second analysis unit (24): generating a response portion output based on the determined target information; and f) generating a response output (A) based on the response portion output. The invention also relates to a system (10) and to the use of such a system (10).

Description

Method, apparatus and application for generating answer output in response to speech input information
Technical Field
The present invention relates to a method and apparatus for generating an answer output in response to speech input information. More precisely, the invention relates to the field of computer-aided speech analysis and in particular of voice conversation systems, wherein speech input information is detected and a response output is generated on the basis thereof.
Background
It is known that: the speech input information is analyzed by means of a computer-aided system and in particular an analysis unit comprised therein and a response output is generated on the basis thereof. In this case, a distinction can be made between pure speech recognition and an analysis of a reasonable response (answer output) to the recognized speech input information, wherein the analysis can comprise, for example, a detection of the semantic content of the speech input information. In this context, speech input information may be understood as information in text or audio format, which is generated on the basis of the speech input and/or which describes the content of the speech input.
Within the framework of so-called object-oriented sessions, the user can, for example, speak questions in order to obtain the desired information. In general, sessions may be referred to as target oriented if they point to a defined purpose, as is the case, for example, when interrogating information. Then the purpose would be to obtain information. Likewise, there are non-target-oriented sessions in which a user has a dialog with a computer-assisted system (session system) without a previously defined purpose, for example, specified by the desired information. This may also be referred to as chatting with such a system.
Handling non-target oriented sessions is technically more demanding, since the course of these non-target oriented sessions, e.g. in the form of so-called session states, is more difficult to estimate. In order to be able to respond to the various possible conversation states that correspond and to produce a suitable answer output, the conversation system must be able to analyze or understand the vast amount of information contained in the speech input information. This would mean: the system and in particular the analysis unit of the system has to be trained by means of a correspondingly rich-content dictionary or a general database. The latter applies mainly to systems based on machine Learning methods and in particular the so-called Deep Learning (Deep Learning) method. For example, in such systems, the corresponding words must be explicitly registered in a dictionary for each conceivable proper name, such as for example the proper names of persons and cities, or for any numerical value, for example in the form of a year or a temperature value, so that the analysis unit can train with the aid of the dictionary and later recognize the corresponding words in order to correctly detect the semantic content. This increases the set-up effort and makes the learning process of the corresponding evaluation unit difficult.
To date, attempts have been made to solve this so-called "vocabulary problem" (out-of-vocabulary problem) by: the analysis unit may identify not only the words that have been learned in the preliminary learning process, but also other words that have been used so far in the conversation history (i.e., words that have been used in the use of the analysis unit so far). However, in this case, it is also necessary to constantly record new words that appear during the conversation history and register them in the corresponding dictionary. Examples of such methods are found in the following scientific publications:
Raghu, D. & Gupta, N. (2018). Hierarchical Pointer Memory Network for Task Oriented Dialogue, arXiv preprint arXiv: 1805.01216
Eric, M & Manning, C. D. (2017). A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. arXiv preprint arXiv: 1701.04024。
on the other hand, systems also exist that attempt to keep track of non-target oriented sessions in an iterative fashion rather than based on machine learning (see, e.g., https:// www.pandorabots.com/mitsuku /). However, the achievable quality is greatly limited by the states and rules to be predefined, for example in the form of so-called expert knowledge.
Thus, with either of these known methods, it is not possible to properly respond to unpredictable speech input information, especially in the context of non-goal-oriented conversations, especially when terms that have not yet appeared in the conversation history and/or in the training situation are used within the speech input information. There is also usually a one-dimensional information flow in the known solutions: the analysis unit may, although for speech recognition purposes, make a request to a dictionary, the dictionary may also be dynamically supplemented based on the session history in the manner set forth above. However, the (in particular dynamically supplemented) entries of the dictionary cannot be used to generate a response output.
Another disadvantage is that at least the above mentioned methods which are not based on machine learning are characterized in that: a large amount of so-called expert knowledge or expert knowledge models are still used as analysis components. However, this deviates from the true goal of handling non-target oriented sessions, as more predefined relationships in the form of expert knowledge have to be defined and registered in advance. Expert knowledge in this context is generally understood as the definition of logical relationships or rules with which a suitable response (in particular a response output) can be determined with respect to an existing state.
Disclosure of Invention
The task of the invention is thus to: computer-based generation of answer output in response to speech input is improved, especially in the case of non-target oriented conversations.
This object is achieved by a method having the features of claim 1, a device having the features of claim 9 and a use having the features of claim 10. Advantageous embodiments are specified in the dependent claims. It is easy to understand that: all the features and definitions mentioned at the outset (alone or in any combination and unless stated or evident otherwise) may also be provided in or applied equally to this solution.
The basic idea of the invention is that: the information of the speech input is divided between two evaluation units, which are preferably designed separately from one another (for example as separate software programs, software modules or computer program products), but which advantageously can also exchange information with one another. One of the evaluation units can be designed to generate a response component output, wherein the latter can be generated in particular in response to frequent repeated occurrences and/or familiar contents of the speech input. While a further evaluation unit can be set up to determine target information which is assigned to a predefined entity within the speech input, wherein this target information can then be a component of the final answer output. These target information can also be used as input variables for the first-mentioned evaluation unit in order to generate a response output.
This has the following advantages: the content of the speech input, that is to say the speech input information, can be predefined as entities, which according to the invention can have a higher variability or be associated with a large amount of other information which cannot be predicted unambiguously. Other entities can also be supplemented during operation and registered, for example, in a knowledge database, for which no new learning or training processes of the system are required. This is important, for example, in order to supplement new information that appears at run-time (e.g., newly released movies, etc.). The entities may be, inter alia, proper names or values. Examples of information associated with an entity in the form of a movie name are, for example, a director, a participating actor, a genre, a movie award, and so forth. The association information itself may also be an entity.
At least one of the correlation information may then be determined by one of the analysis units as the important target information in the current context. The analysis unit can be trained for this purpose (for example by means of a machine learning process), in particular in such a way that it determines the target information on the basis of the current speech context and the selection in terms of associated information (information sets or in other words subsets of the knowledge database). This can be understood in particular as: the analysis unit evaluates all relevant information with respect to its possible importance (for example in the form of a probability specification) and then selects those relevant information with the highest evaluation of importance as target information.
By this, it is possible to realize: additional or targeted information that matches the entity may be reasonably embedded in the final answer output. At most, therefore, the further analysis unit must be trained to process information associated with the predetermined and diverse entities, but not necessarily the one that produces the above-described output of the reply portion. Such association information does not have to be generated by the analysis unit itself, but can be recalled from the knowledge database. This reduces the required functional range of the evaluation unit and thus also eases the learning process of the evaluation unit.
That is, in other words, according to the present invention, it is possible to realize: a distinction is made between content of the speech input that may have increased variations (referred to herein as entities) and/or be associated with a large amount of other information, as well as between information or content of the speech input that often recurs and is not, for example, a proprietary name. For both types of information/content, separate evaluation units can then be provided, which are optimized in accordance with the evaluation task respectively assigned to them and whose learning process can be carried out in a correspondingly targeted manner. Furthermore, embedding of a knowledge database is advantageous, from which one of these analysis units can then select important target information for embedding into the final answer output. As described above, the knowledge database can also be continuously supplemented by new entities, for which purpose the analysis units do not have to force new learning processes.
In detail, a method for generating a response output in response to speech input information (e.g. speech input converted into a text format) is proposed, the method having:
a) determining whether the speech input information includes at least one predefined entity;
and when this is the case:
b) generating a set of speech information in which the entity is not included based on the speech input information;
c) generating an information set (entity information set), wherein the information set comprises information in a knowledge database assigned to the entity;
d) with the first analysis unit: determining from the set of information at least one target information important with respect to the (determined predefined) entity;
e) with the second analysis unit: generating a response portion output based on the determined target information; and is
f) Generating a response output based on the response portion output.
In general, the method can be implemented in a computer-assisted manner and/or by means of at least one (preferably digitally and/or electronically operating) computing unit (for example with a video card, a microprocessor or a general-purpose computer processor). For example, the method may be performed using a conventional PC.
At least one of the speech input information and the response output may exist or be generated in the form of an audio file or as audio information. Additionally or alternatively, the speech input information and/or the answer output may be in text form. In principle, it is possible to specify: the solution according to the invention is used in such a way that the solution is embedded in a system that performs conversion between audio information and text information. For example, a so-called ASR unit (Automatic Speech Recognition) can convert an audio Speech input into text (this then corresponds to the exemplary Speech input information according to the invention), which is then processed by the solution according to the invention. Additionally, a downstream TTS unit may convert the answer output generated according to the invention and based on text into an audio output. However, it is also possible: the solution according to the invention comprises such ASR and/or TTS units or in general audio and text conversion functions. Furthermore, unless otherwise stated or apparent, a reference in text to speech input information may include not only input in audio format but also (audio) input converted into text format (that is to say that the speech input information may generally be present in audio form or in text form, wherein preferably at least temporarily conversion into text form is carried out in order to provide the measures according to the invention).
As described above, for example, item names, place names, object names, blocking names, job names or topic-related names and in particular, proprietary names or values can be predefined as entities. To identify such entities, existing solutions may be used, such as the so-called "named entity recognition" or "named entity resolution" algorithms. In this way, for example, entities in the form of predefined proprietary names can be identified, as they are used, for example, for naming sports teams, movies, places or people.
The set of speech information may be generated in the form of a data record or, in general, as digital information. The set of speech information may comprise speech input information, wherein the speech input information is present at least partially in text form in addition to the entity as a first recognition result. The entity may be deleted or replaced, preferably by a so-called placeholder (or in other words by a template). The placeholder may contain, for example, a title, a generic term, or an identifier that names the entity. Generally preferred are: the second analysis unit also considers the placeholder to determine a response portion output.
If one of these predefined entities is recognized in the speech input information (in particular in the form of text), for example by means of a "named entity resolution" algorithm, the corresponding placeholder can be embedded in the recognition result and/or in the form of text instead of the entity according to the invention. This can be achieved by means of a so-called constructor.
Information can be registered in the knowledge database and can be associated and called up, for example, by means of so-called tags, identifiers, hashes or so-called keyed relations. Through the features listed above, relationships with recognized entities (in the speech input information) can be established. If, for example, a movie title is identified as an entity, in this way it is possible to determine association information in the form of actors performing therein, the year to which it belongs, the place of shooting, etc., which is likewise registered in the knowledge database and preferably linked to the entity by one of the above-mentioned relationships.
That is, in general, hash requests may be directed to a knowledge database to combine information associated in the knowledge database with the entity into an information set.
In the information set, all entries found, for example, from the information database may be collected numerically and/or vectorially. Preferably, these values or vectors may be stored in a matrix, which is then used by the first analysis unit as input information in order to identify the most important target information (preferably based on the analysis result of the second analysis unit as another input parameter).
The first analysis unit and the second analysis unit may be provided as a computer program product and in particular as a computer program or a software module, respectively. However, the first and second evaluation units may have (compared with one another) different input and/or output variables or be set up for processing and generating correspondingly different input and/or output variables.
In general, it should be noted that: unless otherwise stated or apparent, the term analysis in relation to these analysis units may refer to analysis in the sense of determining an appropriate answer output (or at least a component of an answer output), determining target information, and/or detecting semantic content of at least part of the speech input information. The target information may be those information from the information set that have the highest (assumed) importance with respect to the entity aspect. In this case, the importance can be, in particular, the currently existing session state or the importance can be determined (by the first evaluation unit) as a function of the session history. More precisely, the first analysis unit may obtain as input quantities a session history (also synonymously referred to herein as speech context) and a set of information. Within the framework of the machine learning process, the first analysis unit may have been trained to then select one of the information set as target information or to evaluate this information as being of particular importance on the basis of these input quantities. The information may include one or more entities that are deemed important for generating the response portion output. The speech context may be obtained, for example, from the second analysis unit. The first evaluation unit then outputs the target information as an output variable, to be precise preferably as an output variable to the second evaluation unit.
While the second analysis unit can obtain the set of speech information (including in particular the possible placeholders) and the determined target information as input parameters. The second analysis unit may have been trained (e.g. within the framework of a machine learning process) to generate and output a response part output based thereon.
In other words, at least and preferably exactly one of the information contained in the information set and associated with the entity may be selected as target information. Preferably, the target information has the currently highest importance and/or is most suitable to be part of the answer output with respect to the current speech context. Generally preferred are: the second evaluation unit then determines the output of the response component only if the determined target information is known and in particular depending on the determined target information. The target information may, but need not, be part of the response portion output and/or the response output. In general, the target information may include at least one entity or a group of entities. For example, the target information may correspond to a row in the matrix discussed below in which the entries of the information set are ordered or which form the information set.
If for example the movie title is currently mentioned within the framework of the speech input information and in general the quality of the actor lineup is highlighted, it may be important: selecting an actor name as target information associated with the movie title identified as the entity in a next step; and make these actor names part of the response output. And if it is clear from the speech input information that the user wishes to still watch the movie first, it may be more important: the possible play positions or viewing paths of the movie (i.e. the terms DVD, cinema or that part which becomes the answer output, for example) are discussed.
And the second analysis unit may generate the response portion output (e.g., as a text or audio output) based on the set of speech information. In particular, the second analysis unit may determine the current session state and/or the speech context (synonymous in text with session history). The session state may be, for example, a predefined state, such as posing a question about a defined subject area.
The final answer output may be generated by means of an optional reconstruction unit. In general, the answer output may be distinguished from the answer portion output by: the response output contains the response part output, but if necessary also target information and/or entities embedded therein and/or additionally supplemented, which replace placeholders that may also be contained in the response part output.
In other words, the reply output may be a combination of the reply portion output and the target information, and/or may include one or more target information inserted into the reply portion output. In particular, possible placeholders in the output of the reply portion may be replaced by target information but also by other entities. It can be seen in particular that the technical contribution of the solution according to the invention is: the important target information can be determined with a limited cost, and a sentence structure (response portion output) matching the found target information can be formed with the second analysis unit. The above combination of these contents can then be followed to produce a response output.
Thus, according to one embodiment, it may be provided that: in the speech information set the entity is replaced by a placeholder, which placeholder is then replaced again by the entity if the placeholder is also included in the reply part output. This can be done, for example, in the framework of step f). It may also happen that the reply partial output corresponds to the final reply output if the corresponding placeholder is not included. Thus, instead of the response partial output, it is also possible to speak response output information, which may be further processed by replacing the placeholder or which already corresponds to the final response output. One embodiment of the invention provides for: the second analysis unit obtains the analysis result of the first analysis unit, e.g. in the form of the determined target information or a set of (subset) entities forming the target information, in order to determine the response portion output. More precisely, these analysis results may form the input variables of the second analysis unit. In general, the second analysis unit may be trained or taught to determine the appropriate response portion output based on the results of these analyses.
According to one embodiment, the first analysis unit and/or the second analysis unit comprises an (artificial) neural network. In particular, a (so-called depth or "deep") neural network comprising a plurality of layers can be provided at least at the second analysis unit, wherein the so-called input and output layers are linked by a plurality of layers. Each layer of the neural network may contain nodes in a known manner, and these nodes are linked by weighted connections with one or more nodes in adjacent layers. The weights and links of the nodes may be taught within the training or learning process of the neural network in a manner known per se. In general, a neural network may define a non-linear relationship between input parameters and output parameters, and thus may determine suitable output parameters depending on the provided input parameters. This nonlinearity can be caused or taken into account by so-called activation functions of the neural network. These networks may represent the mathematical relationship of the input variables and the output variables by means of corresponding nodes and layers.
In this context, it may also be provided that: the second analysis unit comprises a neural network in the form of a (preferably hierarchical) sequence-to-sequence model (Seq 2 Seq-model). The neural network may comprise, in a manner known per se, at least one encoder which obtains input parameters (e.g. a set of speech information). Furthermore, the encoder may determine a state based on the input parameters and output the state to a decoder of the network. The decoder may generate a reply portion output of the type set forth herein and output the reply portion output as an output parameter.
Alternatively or additionally, the first analysis unit may comprise a neural network in the form of a feedforward network and/or with an attention mechanism. In general, feed forward networks may have no feedback and may therefore be simple, in particular networks with only two layers. The network may obtain as input parameters a set of information and a speech context (session history) and may in particular obtain as input parameters a set of information encoded or scaled into a vector representation.
In general, the first analysis unit may determine a set (subset) of potentially important information from the knowledge database and then determine the target information therefrom. The latter is preferably achieved by means of an attention mechanism. In principle, with the aid of the attention mechanism, a value (score) can be calculated for each piece of information of the subset on the basis of the session history, which value specifies the assumed importance of the information, for example in terms of probability. The information of the low estimates cannot be selected as target information and has no effect on the answering portion output, since these low estimated information are not taken into account by the second analysis unit.
According to one embodiment, the first evaluation unit obtains a vector representation (or in other words a vector representation) of the information set as input variables. This can be achieved by so-called Embedding (Embedding), where the information registered in the knowledge database (those assigned to the entity) is scaled to a vector representation. The result may be a set of information evaluated or examined by the first analysis unit. Alternatively, the set of information that existed prior to the corresponding vector conversion may be treated as the set of information, and a vector representation of the set of information may be generated based on the embedding.
In general, scaling to a vector representation provides the following advantages: the evaluation unit can be based on a mathematical model or a mathematical function and can thus better process the input variables in the form of a vector representation. In this case, it is advantageously also possible to embed speech information based on known methods such as "Word 2 Vec" or "GloVe".
According to a further embodiment, the second analysis unit is not set up to determine the target information (or target entity) for the determined entity (itself) based on the information set in the knowledge database. The second analysis unit may in particular not be set up to determine the important and/or assigned target information (or target entities), in particular not based on entries of the knowledge database and/or the above-mentioned information sets. In particular, the evaluation unit may not be trained for this purpose and/or may not depict or recognize corresponding relationships. However, the second analysis unit may be set up and in particular trained to: comprehend the target information and/or generate a matching response portion output based thereon. However, the determination of the target information itself should preferably be effected by means of the first analysis unit. This can be achieved: instead of this, the second evaluation unit can be dedicated to other tasks and the training expenditure can be adapted accordingly. For example, the second analysis unit, although it should be able to handle placeholders or entity classes in general, such as "movie titles", if necessary, does not have to learn all entities or proper names specifically associated therewith. Alternatively, these specific entities (preferably but not optionally placeholders used for this purpose) can be specifically excluded from the analysis scope of the second analysis unit, which, instead, preferably generates a response part output at most on the basis of the placeholders or entity classification and the target information and the speech context and is correspondingly trained for this purpose.
Additionally or alternatively, provision may be made for: the first speech analysis unit does not obtain the set of speech information as input parameters and/or is not set up to independently generate a response portion output (or the response portion output).
The invention also relates to a system for generating a response output in response to speech input information, the system having:
a) a determination unit which is set up to determine whether the speech input information comprises at least one predefined entity;
b) a voice information set generating unit set up to generate a voice information set in which the determined entity is not contained based on the voice input information;
c) an information set generation unit which is set up to generate an information set relating to the entity, wherein the information set contains information in a knowledge database which is assigned to the entity;
d) a first evaluation unit which is set up to determine at least one item of target information which is relevant in relation to the entity from the information set;
e) a second analysis unit which is set up to generate a response portion output based on the determined target information; and
f) a response output unit, which is set up to generate a response output based on the response portion output.
Finally, the invention also relates to the use of a system according to the above-described aspect for generating a response output in the case of a non-target-oriented session. It has been shown that: the provision of the two evaluation units according to the invention and thus the division of the respective evaluation tasks (between the user and the computer-aided system) in the case of corresponding non-target-oriented sessions is particularly advantageous.
In general, the system may be set up to implement a method according to any of the above and below aspects. All the embodiments discussed in the context with similar method features can likewise be provided with the same system features. In general, the system may be computer-based or referred to as a computer-based voice conversation system. The elements of the system may be implemented as software modules and/or may be implemented as components of a computer program product (e.g., in the form of respective software functions). Optionally, the system may comprise: a computing unit (e.g., a processor) on which the units are preferably implemented; and/or a storage device on which the knowledge database is preferably stored.
In the following, embodiments of the invention are explained on the basis of the attached drawings. Wherein:
fig. 1 shows an overview of a system according to an embodiment of the invention;
FIG. 2 shows a detailed view of the system of FIG. 1; and
fig. 3 shows a flow chart of a method according to the invention implemented with the system of fig. 1 and 2.
In fig. 1, a system 10 according to an embodiment of the invention is schematically shown. The system 10 is implemented as a computer program product in which the functions or functional blocks described below can be implemented by means of individual program modules in a manner known per se. The system 10 may be implemented on a computing unit of a conventional PC. Only the knowledge database 14 explained below can be understood as a data collection, not primarily as a software module or a function that can be implemented by a computing unit, but can be registered as a data collection of the system 10. All units mentioned herein of the system 10 may correspondingly be realized as software modules, computer program products or software functions. However, the system 10 may alternatively comprise hardware components in the form of computing units (processors and/or graphics cards) and/or storage for knowledge databases.
The system 10 obtains as input parameters speech input information E in the form of audio input which has been converted into text form. In the determination unit 12, which is a Named Entity Recognizer (NER) or a Named Entity Resolver (Named Entity Resolver) and is based on a known algorithm, predefined entities contained in the speech input information can be recognized. For this purpose, the determination unit 12 can use a knowledge database 14, which is described below, in which the corresponding entities are registered. The recognition of entities in speech input information, especially when these entities already exist in text form, is known per se and can be translated in practice by using conventional named entity recognizers, which can usually be based on neural networks.
The determination unit 12 may define a corresponding entity contained in the speech input information E as an output parameter. The output signal of the determination unit 12 is fed to a speech information set generation unit 16, which may also be referred to as constructor. The speech information set generation unit 16 generates a speech information set (for example in the form of a text file) in which these entities are respectively replaced by placeholders or in general in which no recognized entity is present. The speech information set is supplied as input variables to an evaluation block 18.
This has the following advantages: the second analysis unit 24 (see fig. 2) which is set forth below and is contained in the analysis block 18 which is only schematically outlined in fig. 1, does not have to be set up for analyzing or based on a specific entity, but at most for processing the corresponding placeholder. It is easy to understand that: these entities have significantly higher variance than placeholders, which should be understood as a generic term for a large number of different entities. This may reduce learning or training costs. Instead of this, the corresponding second analysis unit 24 may be focused on (and in particular limited to) determining a general speech context, wherein possible entities are completely excluded or only considered as placeholders (e.g. in the form of "movies" (placeholders) instead of specific movie titles (entities) within the text file representing the speech input information). The analysis block 18 may also access the knowledge database 14, as outlined by the corresponding arrows in fig. 1.
The analysis block 18 outputs a response portion output described below as an output parameter. If the response part output still contains placeholders, these are replaced by matching entities in the response output unit 21, which can also be referred to as reconstructor, and are therefore combined with the response part output into a response output a. The answer output a may in turn be or comprise a text file and may for example be output to the user via an audio output device after conversion to an audio file.
In fig. 2, a detailed view of the system 10 of fig. 1 is shown. In which, among other things, further details of the analysis block 18 can be seen. Furthermore, speech input information E in the form of text is shown, as well as a determination unit 12, which can recognize entities within the input variable E by means of a knowledge database 14. A speech information generating unit 16 is also shown.
It should be noted that: the access of the knowledge database 14 by the determination unit 12 is entirely optional and may be done, for example, for improving the recognition accuracy. In general and without being limited to this example and other details thereof, the determination unit 12 may also process information, for example from the speech input information E, in order to improve the recognition accuracy. If, for example, the information contains an actor's name, the probability that another entity in the input is a movie title increases.
The analysis block 18 is outlined with dashed lines and the various functional blocks of the analysis block are shown. First, a vector generation unit 20 (embedder) is shown, with which the information set described below can be represented as a vector representation. To this end, the information set generation unit 25 integrated into the vector generation unit 20 can access the knowledge database 14 and first determine the information associated with or assigned to the determined entity from the knowledge database 14. This can be achieved by a so-called hash request. This information can then be aggregated into a set (subset) of information and scaled into a vector representation.
The latter then forms the input variables of the first evaluation unit 22. This first evaluation unit is designed as a "feed-forward" neural network and can select the information contained therein as target information using the information set represented as a vector representation, on the basis of which the second evaluation unit 24 explained below is intended to determine the response part output. For this purpose, the speech context set forth below is also taken into account, which is the result of the analysis by the second analysis unit 24 (see the corresponding arrow-shaped presented connections between these units in fig. 2). The neural network of the first analysis unit 22 then determines, as target information, information from the information set assigned to the entity that is important, in particular with regard to the current speech context. Preferably, the target information is that information of the set of information which is classified as important by the first analysis unit 22 with the highest probability or from which the first analysis unit 22 has obtained the highest evaluation of importance. This object information forms the output variable of the second evaluation unit 22 and is output to the first evaluation unit 24.
It is also seen that: the analysis block 18 comprises a second analysis unit 24 which obtains the set of speech information of the generation unit 16 and the target information of the first analysis unit 22 as input parameters.
The second analysis unit 24 determines the answer portion output on the basis of these input quantities and taking into account the placeholders embedded by the speech information set generation unit 16 and the continuously determined and updated speech context. The speech context indicates, among other things, which type of answer output a may be important and will be suitable from the user's point of view. The speech context is also output to the first analysis unit 22 in order to then obtain the target information determined therefrom.
In the determined output of the response part, neither the entity of the speech input information E nor the associated target information is contained. The latter is determined by the first analysis unit 22 in the manner described above and is embedded only in the response output section 21, to be precise in such a way that possible placeholders still present in the response section output can be replaced, for example, by corresponding target information.
In the following, an exemplary flow chart of the method according to the invention is depicted in accordance with fig. 3. In step S1, the speech input information E is obtained in text form and it is checked by the determination unit 12 whether an entity exists. If no entity is ascertained, the analysis block 18 determines the answer output a only by means of the first analysis unit 24, wherein this case is not separately shown in fig. 3.
If it is determined that the entity exists, a speech information set in which the entities in the speech input information E are respectively replaced by placeholders is generated by the speech information set generating unit 16 in step S2.
Upstream, downstream (as in fig. 3) or at least partially in parallel to step S2, the information set generation unit 25 determines information assigned to the entity from the knowledge database 14 in step S5. These pieces of information are summarized in the above-described information set and converted into vector representations by the vector generation unit 20 in step S6.
Next, the first analysis unit 22 determines one of the information set as the target information in step S7, taking into account the context of the speech available from the second analysis unit 24. Next, in step S8, the second analysis unit 24 determines a response section output from the speech information set and the target information of the first analysis unit 22. If there are placeholders still present in the output of the answer section, these can be replaced by entities by means of the reconstruction unit 21 with optional access to the knowledge database 14 (see corresponding arrows in fig. 1, 2).
An example of speech input information is "i like the first jempster movie very much". Where the name "jams bane" as the movie title can be recognized as an entity. Thus, [ movie ]]Or [ movie ]]Can be used as a placeholder and all information matching the movie can be determined from the knowledge database 14 as a set of information. From this set of information, the first analysis unit 22 determines target information that is plausible in terms of the speech context, for example the names of actors (here X and Y by way of example). Then, the second analysis unit 24 will generate a matching answer part output in view of the one or more entities determined as the target information and in view of the speech context, e.g. with "you prefer X or Y as player
Figure DEST_PATH_IMAGE001
"against the question.
According to an embodiment of the invention, which is not shown separately, which is independent of the aspects of the illustrated embodiment, at least the first analysis unit (but optionally also the second analysis unit) may be based on an expert knowledge model, instead of on a neural network. Advantageously, then no training data for the corresponding analysis unit is required.
List of reference numerals
10 system
12 determination unit
14 knowledge database
16 speech information set generating unit
18 analysis block
20 vector generation unit
22 first analysis unit
24 second analysis unit
25 information set generating unit
A answer output
E, inputting information by voice.

Claims (10)

1. A method for generating a response output (a) in response to speech input information (E), the method having:
a) determining whether the speech input information (E) comprises at least one predefined entity;
and when this is the case:
b) generating a set of speech information, in which the entity is not comprised, based on the speech input information (E);
c) generating an information set, wherein the information set comprises information in a knowledge database (14) assigned to the entity;
d) with a first analysis unit (22): determining from the set of information at least one piece of target information important in respect of the entity;
e) with a second analysis unit (24): generating a response portion output based on the determined target information; and is
f) Generating a response output (A) based on the response portion output.
2. Method according to claim 1, characterized in that in the speech information set the entity is replaced by a placeholder, which placeholder is then replaced again by the entity if the placeholder is also contained in the reply partial output.
3. The method according to claim 1 or 2, characterized in that the first and/or the second analysis unit (22, 24) comprises a neural network.
4. The method according to claim 3, characterized in that the second analysis unit (24) comprises a neural network in the form of a sequence-to-sequence model.
5. A method according to claim 3, characterized in that the first analysis unit (22) comprises a neural network in the form of a feedforward network and/or a neural network with attention mechanism.
6. Method according to any of the preceding claims, characterized in that the first analysis unit (22) obtains a vector representation of the information set as input quantities.
7. The method according to any one of the preceding claims, characterized in that the second analysis unit (24) is not set up to determine target information for the determined entity based on the set of information in the knowledge database (14).
8. Method according to one of the preceding claims, characterized in that the first analysis unit (22) does not obtain the set of speech information as input variables and/or is not set up to produce a response part output on the basis thereof.
9. A system (10) for generating a response output (a) in response to speech input information (E), the system having:
a) a determination unit (12) which is set up to determine whether the speech input information (E) comprises at least one predefined entity;
b) a speech information set generation unit (16) which is set up to generate a speech information set in which the determined entity is not contained, on the basis of the speech input information;
c) an information set generating unit (25) which is set up to generate an information set relating to the entity, wherein the information set contains information in a knowledge database (14) which is assigned to the entity;
d) a first evaluation unit (22) which is set up to determine at least one item of target information which is relevant in relation to the entity from the information set;
e) a second evaluation unit (24) which is set up to generate a response component output on the basis of the determined target information; and
f) a response output unit (21) which is designed to generate a response output on the basis of the response component output.
10. Use of the system (10) according to claim 9 for generating an answer output in case of a non-target oriented session.
CN201980084669.4A 2018-12-18 2019-11-11 Method, apparatus and application for generating answer output in response to speech input information Pending CN113228165A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
DE102018222156.1A DE102018222156A1 (en) 2018-12-18 2018-12-18 Method, arrangement and use for generating a response in response to a voice input information
DE102018222156.1 2018-12-18
PCT/EP2019/080901 WO2020126217A1 (en) 2018-12-18 2019-11-11 Method, arrangement and use for producing a response output in reply to voice input information

Publications (1)

Publication Number Publication Date
CN113228165A true CN113228165A (en) 2021-08-06

Family

ID=68581776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980084669.4A Pending CN113228165A (en) 2018-12-18 2019-11-11 Method, apparatus and application for generating answer output in response to speech input information

Country Status (3)

Country Link
CN (1) CN113228165A (en)
DE (1) DE102018222156A1 (en)
WO (1) WO2020126217A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192498A (en) * 2021-05-26 2021-07-30 北京捷通华声科技股份有限公司 Audio data processing method and device, processor and nonvolatile storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325527A (en) * 1998-09-09 2001-12-05 单一声音技术公司 Interactive user interface using speech recognition and natural language
CN101075435A (en) * 2007-04-19 2007-11-21 深圳先进技术研究院 Intelligent chatting system and its realizing method
DE102007042583A1 (en) * 2007-09-07 2009-03-12 Audi Ag Method for communication between natural person and artificial language system, involves issuing response depending on recognition of input, and controlling movement of avatar, design of avatar and visually displayed environment of avatar
CN104239343A (en) * 2013-06-20 2014-12-24 腾讯科技(深圳)有限公司 User input information processing method and device
US20150120288A1 (en) * 2013-10-29 2015-04-30 At&T Intellectual Property I, L.P. System and method of performing automatic speech recognition using local private data
CN108255934A (en) * 2017-12-07 2018-07-06 北京奇艺世纪科技有限公司 A kind of sound control method and device
CN108351893A (en) * 2015-11-09 2018-07-31 苹果公司 Unconventional virtual assistant interactions
US20180314689A1 (en) * 2015-12-22 2018-11-01 Sri International Multi-lingual virtual personal assistant

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160027640A (en) * 2014-09-02 2016-03-10 삼성전자주식회사 Electronic device and method for recognizing named entities in electronic device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325527A (en) * 1998-09-09 2001-12-05 单一声音技术公司 Interactive user interface using speech recognition and natural language
CN101075435A (en) * 2007-04-19 2007-11-21 深圳先进技术研究院 Intelligent chatting system and its realizing method
DE102007042583A1 (en) * 2007-09-07 2009-03-12 Audi Ag Method for communication between natural person and artificial language system, involves issuing response depending on recognition of input, and controlling movement of avatar, design of avatar and visually displayed environment of avatar
CN104239343A (en) * 2013-06-20 2014-12-24 腾讯科技(深圳)有限公司 User input information processing method and device
US20140379738A1 (en) * 2013-06-20 2014-12-25 Tencent Technology (Shenzhen) Company Limited Processing method and device of the user input information
US20150120288A1 (en) * 2013-10-29 2015-04-30 At&T Intellectual Property I, L.P. System and method of performing automatic speech recognition using local private data
CN108351893A (en) * 2015-11-09 2018-07-31 苹果公司 Unconventional virtual assistant interactions
US20180314689A1 (en) * 2015-12-22 2018-11-01 Sri International Multi-lingual virtual personal assistant
CN108255934A (en) * 2017-12-07 2018-07-06 北京奇艺世纪科技有限公司 A kind of sound control method and device

Also Published As

Publication number Publication date
DE102018222156A1 (en) 2020-06-18
WO2020126217A1 (en) 2020-06-25

Similar Documents

Publication Publication Date Title
Zadeh et al. Memory fusion network for multi-view sequential learning
CN111144127B (en) Text semantic recognition method, text semantic recognition model acquisition method and related device
Bejani et al. Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks
US11862145B2 (en) Deep hierarchical fusion for machine intelligence applications
CN112100383A (en) Meta-knowledge fine tuning method and platform for multitask language model
CN110728298A (en) Multi-task classification model training method, multi-task classification method and device
Aafaq et al. Dense video captioning with early linguistic information fusion
Kong et al. Symmetrical enhanced fusion network for skeleton-based action recognition
Dang et al. Dynamic multi-rater gaussian mixture regression incorporating temporal dependencies of emotion uncertainty using kalman filters
Anantha Rao et al. Selfie continuous sign language recognition with neural network classifier
Toor et al. Question action relevance and editing for visual question answering
CN117834780B (en) Intelligent outbound customer intention prediction analysis system
JP6408729B1 (en) Image evaluation apparatus, image evaluation method, and program
Hasan et al. TextMI: Textualize multimodal information for integrating non-verbal cues in pre-trained language models
Bielski et al. Pay Attention to Virality: understanding popularity of social media videos with the attention mechanism
Kankanhalli et al. Experiential sampling in multimedia systems
CN113228165A (en) Method, apparatus and application for generating answer output in response to speech input information
Hou et al. Confidence-guided self refinement for action prediction in untrimmed videos
Manousaki et al. Vlmah: Visual-linguistic modeling of action history for effective action anticipation
CN109885668A (en) A kind of expansible field interactive system status tracking method and apparatus
WO2020054822A1 (en) Sound analysis device, processing method thereof, and program
Pham et al. Speech emotion recognition: A brief review of multi-modal multi-task learning approaches
KR102574434B1 (en) Method and apparatus for realtime construction of specialized and lightweight neural networks for queried tasks
CN114970467A (en) Composition initial draft generation method, device, equipment and medium based on artificial intelligence
CN111462893B (en) Chinese medical record auxiliary diagnosis method and system for providing diagnosis basis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination