CN113971212A - Multilingual question and answer method and device, electronic equipment and storage medium - Google Patents

Multilingual question and answer method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113971212A
CN113971212A CN202010728828.4A CN202010728828A CN113971212A CN 113971212 A CN113971212 A CN 113971212A CN 202010728828 A CN202010728828 A CN 202010728828A CN 113971212 A CN113971212 A CN 113971212A
Authority
CN
China
Prior art keywords
text
question
paragraph
answer
multilingual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010728828.4A
Other languages
Chinese (zh)
Inventor
李省平
肖达
莫兆全
钱胜杰
袁行远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Caicheng Ming Technology Co ltd
Beijing Caiyun Ring Pacific Technology Co ltd
Original Assignee
Guangzhou Caicheng Ming Technology Co ltd
Beijing Caiyun Ring Pacific Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Caicheng Ming Technology Co ltd, Beijing Caiyun Ring Pacific Technology Co ltd filed Critical Guangzhou Caicheng Ming Technology Co ltd
Priority to CN202010728828.4A priority Critical patent/CN113971212A/en
Publication of CN113971212A publication Critical patent/CN113971212A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a multilingual question-answering method, a device, an electronic device and a storage medium, which are characterized in that a question text input by a user is obtained firstly, then a paragraph selection model and a multilingual word list are utilized to determine a to-be-selected paragraph corresponding to the question text from a preset resource library, and finally an answer generation model and the multilingual word list are utilized to determine an answer of the question text from the to-be-selected paragraph. The method and the device realize that multiple translations are avoided by using the multilingual word list in the stage of searching the paragraph to be selected and generating the answer, and solve the problems that the multiple translations take longer time and the semantics are possibly biased to influence the answer quality, thereby influencing the user experience. The method and the device achieve the technical effects of quickly generating multi-language question and answer answers, providing a convenient and efficient solution for the international popularization of the intelligent question and answer system or the use of a multi-language environment, and avoiding the need of setting language modes or selecting intelligent question and answer system products corresponding to different languages for users.

Description

Multilingual question and answer method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of natural language processing, and in particular, to a multilingual question-answering method and apparatus, an electronic device, and a storage medium.
Background
With the development of computer technology, natural language processing technology is more and more widely applied in the field of intelligent question and answer robots, such as microsoft Xiaobing, Baidu Xiaodu, Siri of apple, Xiaoaixiang classmates of millet and the like, and most of them adopt an open-domain question and answer processing mode.
At present, the open domain question-answering system mainly comprises three parts, namely question text analysis, paragraph extraction and answer extraction. Different algorithms can be adopted in each module, and better precision is obtained. The answer extraction module almost adopts four structures of CNN (conditional Neural Networks), RNN (current Neural Networks), LSTM (Long-Short Term Memory model), Transformer (self-attention model) or a combination of the four structures to extract or generate answers in a database or a semantic library. However, with the international popularization of intelligent application products, some users may interact with the intelligent question-answering robot in different natural language languages at different time intervals, and even refer to multiple natural language languages in the same session, for example, a user uses a question text composed of three languages of cantonese, english, and mandarin in a question-answering section, the existing multilingual question-answering system can only switch between two languages, or translate the language input by the user into the language corresponding to the natural language processing model of the user through the translator, and after generating an answer, translate the language into a translation answer corresponding to the natural language input by the user through the translator.
This results in the prior multilingual question-answering system needing to go through at least two translation processes, which is time-consuming, and the translation also causes deviation of original semantics, which may result in poor answer quality due to deviation of question text of the translation user, and also may result in deviation of answer translation, whichever may affect the user experience, and increase the waiting time for the user to obtain the desired answer.
Disclosure of Invention
The application provides a multilingual question and answer method, device, electronic equipment and storage medium, which are used for solving the technical problems that in the prior art, a question and answer result can be obtained only through multiple translations, the time consumption is long, semantic deviation is easy to generate, the quality of an obtained answer is poor, and the use experience of a user is influenced.
In a first aspect, the present application provides a multilingual question-answering method, including:
acquiring a question text input by a user, wherein the question text comprises at least one language text;
determining a paragraph to be selected corresponding to the problem text from a preset resource library by using a paragraph selection model and a multilingual vocabulary, wherein the multilingual vocabulary is used for analyzing vocabularies of all languages in the problem text, and the paragraph to be selected output by the paragraph selection model is matched with the language characteristics of the input problem text;
and determining an answer of the question text from the paragraph to be selected by using an answer generation model and the multi-language word list, wherein the answer output by the answer generation model is matched with the language feature of the input paragraph to be selected.
In a possible design, the determining, from a preset resource library, a to-be-selected paragraph corresponding to the question text by using the paragraph selection model and the multilingual vocabulary, includes:
determining a retrieval result corresponding to the question text from the preset resource library by using a retrieval model;
determining a paragraph text according to the retrieval result and the multilingual word list by using a preset word segmentation algorithm;
determining a first text vector according to the paragraph text and a combination algorithm;
determining a second text vector according to the first text vector and a paragraph selection model;
and determining the paragraph to be selected according to the second text vector and a preset screening algorithm.
In a possible design, the determining the paragraph to be selected according to the second text vector and a preset filtering algorithm includes:
extracting a first word segmentation vector of the second text vector;
determining the selection probability of the paragraph text according to the first word segmentation vector and a preset decoding model;
and if the selection probability is greater than or equal to a preset probability threshold, taking the paragraph text as the paragraph to be selected.
Optionally, the determining, by using the retrieval model, a retrieval result corresponding to the question text from the preset resource library includes:
cleaning the content of the problem text by using a problem text cleaning algorithm to determine the cleaning problem text meeting a preset format;
determining a retrieval result to be cleaned according to the cleaning problem text and the retrieval model;
and determining the retrieval result according to the retrieval result to be cleaned and a result cleaning algorithm.
In one possible design, the determining a search result to be cleaned according to the cleaning question text and the search model includes:
inputting the cleaning question text into a search engine to determine a webpage address;
and capturing the content in the webpage corresponding to the webpage address by using a content capturing model as the retrieval result to be cleaned.
In one possible design, the determining a search result to be cleaned according to the cleaning question text and the search model includes:
screening an adaptive answer file corresponding to the cleaning question text in a preset question and answer file library by using a matching algorithm;
and determining the retrieval result to be cleaned according to the adaptive answer file and the content grabbing model.
Optionally, the determining the answer to the question text from the paragraph to be selected by using the answer generation model and the multilingual vocabulary includes:
combining the question text and the paragraph to be selected to generate a third text vector;
determining sentence coding vectors according to the third text vectors and a language representation model, wherein the language representation model is matched with the multilingual word list;
determining the grade of the sentence coding vector according to the sentence coding vector and a grading decoding model;
if the score is larger than or equal to a preset score threshold value, determining the answer corresponding to the sentence coding vector by using the answer generation model and the multilingual word list;
and if the score is smaller than the preset score threshold value, generating the answer according to a preset reply template.
Optionally, the sentence encoding vector includes: the method comprises the following steps of paragraph segmentation, multi-language vocabulary segmentation, first attribute information and second attribute information, wherein the first attribute information is attribute information corresponding to the paragraph segmentation, and the second attribute information is attribute information corresponding to the multi-language vocabulary segmentation.
In one possible design, the determining a score for the sentence-coding vector based on the sentence-coding vector and a scoring decoding model includes:
extracting a second word segmentation vector positioned at the head of the sentence coding vector;
and determining the grade of the sentence coding vector according to the second word segmentation vector and a grading decoding model.
Optionally, the determining the answer corresponding to the sentence coding vector by using the answer generation model and the multi-language vocabulary includes:
decoding the sentence encoding vector by using a conversion decoding model to determine a decoding word vector;
determining a paragraph score matrix according to the first multilayer perception network and the decoding word vector;
determining a word list fraction matrix according to a second multilayer perception network and the decoding word vector;
determining a comprehensive score matrix according to the paragraph score matrix and the vocabulary score matrix by using a superposition algorithm;
and determining the answer according to the comprehensive score matrix by using the answer generation model, the sentence coding vector and the multilingual word list.
Optionally, after the acquiring the question text input by the user, the method further includes:
performing word segmentation on the problem text by using a word segmentation algorithm to determine a problem text word vector;
determining the language corresponding to the question text according to the vector of the question text and a language matching algorithm;
and determining the multilingual word list from the multilingual word list to be selected according to the languages, wherein the multilingual word list at least comprises two languages.
In a second aspect, the present application provides a multilingual question-answering apparatus, comprising:
the acquisition module is used for acquiring a question text input by a user;
the paragraph selection module is used for determining a paragraph to be selected according to the problem text by utilizing a paragraph selection model and a multi-language word list, wherein the multi-language word list comprises words of all languages related to the problem text;
and the answer generating module is used for determining the answer of the question text from the paragraph to be selected by utilizing an answer generating model and the multi-language word list, and the answer output by the answer generating model is matched with the language characteristics of the input paragraph to be selected.
In one possible design, the paragraph selection module is configured to determine a paragraph to be selected according to the question text by using a paragraph selection model and a multilingual vocabulary, where the multilingual vocabulary includes vocabularies of all languages related to the question text, and includes:
the paragraph selection module is used for determining a retrieval result corresponding to the question text from the preset resource library by using a retrieval model;
the paragraph selection module is further used for determining a paragraph text according to the retrieval result and the multilingual word list by using a preset word segmentation algorithm;
the paragraph selection module is further configured to determine a first text vector according to the paragraph text and a combination algorithm;
the paragraph selection module is further configured to determine a second text vector according to the first text vector and a paragraph selection model;
the paragraph selection module is further configured to determine the paragraph to be selected according to the second text vector and a preset screening algorithm.
In a possible design, the paragraph selection module is further configured to determine the paragraph to be selected according to the second text vector and a preset filtering algorithm, and includes:
the paragraph selection module is further configured to extract a first segmentation vector of the second text vector;
the paragraph selection module is further configured to determine a selection probability of the paragraph text according to the first segmentation vector and a preset decoding model;
the paragraph selection module is further configured to take the paragraph text as the to-be-selected paragraph if the selection probability is greater than or equal to a preset probability threshold.
Optionally, the paragraph selecting module is configured to determine, by using a search model, a search result corresponding to the question text from the preset resource library, and includes:
the paragraph selection module is used for cleaning the problem text by using a problem text cleaning algorithm so as to determine the cleaning problem text meeting the preset format;
the paragraph selection module is further used for determining a retrieval result to be cleaned according to the cleaning question text and the retrieval model;
the paragraph selection module is further configured to determine the search result according to the search result to be cleaned and a result cleaning algorithm.
In one possible design, the paragraph selection module is further configured to determine a search result to be cleaned according to the cleaning question text and the search model, and includes:
the paragraph selection module is further used for inputting the cleaning question text into a search engine to determine a webpage address;
the paragraph selection module is further configured to capture content in a webpage corresponding to the webpage address as the retrieval result to be cleaned by using a content capture model.
In one possible design, the paragraph selection module is further configured to determine a search result to be cleaned according to the cleaning question text and the search model, and includes:
the paragraph selection module is further used for screening an adaptive answer file corresponding to the cleaning question text in a preset question and answer file library by using a matching algorithm;
the paragraph selection module is further configured to determine the search result to be cleaned according to the adapted answer file and the content grabbing model.
Optionally, the answer generating module is configured to determine an answer to the question text from the paragraph to be selected by using an answer generating model and the multilingual word list, and includes:
the answer generation module is used for combining the question text and the paragraph to be selected to generate a third text vector;
the answer generation module is further used for determining sentence coding vectors according to the third text vectors and a language representation model, and the language representation model is matched with the multilingual word list;
the answer generation module is further used for determining scores of the sentence coding vectors according to the sentence coding vectors and a score decoding model;
the answer generation module is further configured to determine the answer corresponding to the sentence coding vector by using the answer generation model and the multi-language vocabulary if the score is greater than or equal to a preset score threshold;
the answer generation module is further configured to generate the answer according to a preset reply template if the score is smaller than the preset score threshold.
In one possible design, the answer generation module is further configured to determine a score of the sentence-coding vector according to the sentence-coding vector and a score-decoding model, and includes:
the answer generating module is also used for extracting a second word segmentation vector positioned at the head of the sentence coding vector;
and the answer generation module is also used for determining the grade of the sentence coding vector according to the second word segmentation vector and a grading decoding model.
Optionally, the answer generating module is further configured to determine the answer corresponding to the sentence coding vector by using the answer generating model and the multi-language vocabulary if the score is greater than or equal to a preset score threshold, and the determining includes:
if the score is greater than or equal to a preset score threshold, the answer generation module is further configured to decode the sentence coding vector by using a conversion decoding model to determine a decoded word vector;
the answer generation module is further used for determining a paragraph score matrix according to the first multilayer perception network and the decoded word vector;
the answer generation module is further used for determining a word list score matrix according to a second multilayer perception network and the decoded word vector;
the answer generation module is further used for determining a comprehensive score matrix according to the paragraph score matrix and the vocabulary score matrix by using a superposition algorithm;
the answer generation module is further configured to determine the answer according to the comprehensive score matrix by using the answer generation model, the sentence coding vector and the multilingual word list.
Optionally, after the obtaining module is configured to obtain the question text input by the user, the method further includes:
the preprocessing module is used for segmenting words of the problem text by utilizing a word segmentation algorithm so as to determine a problem text word vector;
the preprocessing module is further used for determining the language corresponding to the question text according to the vector of the question text and a language matching algorithm;
the preprocessing module is further used for determining the multilingual word list from the multilingual word list to be selected according to the languages, and the multilingual word list at least comprises two languages.
In a third aspect, the present application further provides an electronic device, including:
a processor; and the number of the first and second groups,
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute any one of the possible multilingual question-answering methods provided in the first aspect via execution of the executable instructions.
In a fourth aspect, the present application provides a storage medium having stored thereon a computer program which, when executed by a processor, implements any one of the possible multilingual question-answering methods provided in the first aspect.
The application provides a multilingual question-answering method, a device, an electronic device and a storage medium, which are characterized in that a question text input by a user is obtained firstly, then a paragraph selection model and a multilingual word list are utilized to determine a to-be-selected paragraph corresponding to the question text from a preset resource library, and finally an answer generation model and the multilingual word list are utilized to determine an answer of the question text from the to-be-selected paragraph. The method and the device realize that multiple translations are avoided by using the multilingual word list in the stage of searching the paragraph to be selected and generating the answer, and solve the problems that the multiple translations take longer time and the semantics are possibly biased to influence the answer quality, thereby influencing the user experience. The method and the device achieve the technical effects of quickly generating multi-language question and answer answers, providing a convenient and efficient solution for the international popularization of the intelligent question and answer system or the use of a multi-language environment, and avoiding the need of setting language modes or selecting intelligent question and answer system products corresponding to different languages for users.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a schematic diagram illustrating the operation of a conventional multilingual question-answering system according to the present application;
FIG. 2 is a flow chart of a multilingual question-answering method according to an embodiment of the present application;
FIG. 3 is a schematic flowchart illustrating a multilingual question-answering method for determining paragraphs to be selected according to an embodiment of the present application;
fig. 4 is a schematic diagram of a to-be-selected paragraph determination process according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating an embodiment of a multi-lingual question and answer method for generating answers according to the present application;
FIG. 6 is a schematic diagram illustrating the principle of generating answers by a multilingual question-answering method according to an embodiment of the present application;
fig. 7 is a schematic diagram illustrating a change of a vector dimension when an answer scoring decoding layer decodes an input vector according to an embodiment of the present application;
fig. 8 is a schematic diagram illustrating an operating principle of a pointer network according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of a multilingual question-answering apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, such as simple combination of the embodiments in the present application, or combination of partial steps in the embodiments, or order change, which can be obtained by a person skilled in the art without inventive efforts, are within the scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
With the popularization of the intelligent question-answering system in various application products, the application scenarios faced by the intelligent question-answering system are more and more complicated and diversified. One of the application scenarios is multi-language question answering. China is a multi-national country, different dialects used by different regions are provided, under the policy of popularizing mandarin for many years, a plurality of users can use a plurality of languages to ask and answer at the same time, for example, some vocabularies or sentences use the Bashu dialects during the asking and answering, some modern vocabularies use English, and the Mandarin is combined to form a multi-language question. In addition, as the economy is continuously developed, joint venture enterprises or other foreign venture enterprises are more ubiquitous, so that employees of the enterprises can ask questions of the intelligent question and answer system by using multiple languages simultaneously, or different employees ask questions by using different languages. Therefore, the multilingual question-answering system becomes an important development direction of the intelligent question-answering system.
Fig. 1 is a schematic diagram illustrating the working principle of the conventional multilingual question-answering system provided in the present application. As shown in fig. 1, the conventional multi-language question-answering system receives a question input by a user, and then translates the question into a language used by a developer of the question-answering system, or a core language corresponding to a natural language processing model. For example, a user inputs a question expressed in english, and the core language corresponding to the natural language processing model is chinese, and at this time, the user wants to translate english into chinese through the translation module. Then the natural language processing model analyzes the translated question, searches the semantic paragraphs which can be used for organizing the answer question from the database, extracts the paragraphs to be selected, and then utilizes the algorithm to screen the paragraphs to be selected and organize the answers which accord with the human language grammar or natural language habit. Following the above example, if the answer is Chinese, it needs to be translated into English again by the translation module and then output.
Most of the conventional multilingual question-answering systems shown in fig. 1 can only be applied to two languages, and a small number of multilingual question-answering systems can be applied to multiple languages, but as the languages increase, the required translation time greatly increases, and the conventional translation algorithms cannot completely ensure that the translated semantics do not deviate, so that the quality of the generated answers depends on the quality of the translation to a great extent. For example, the question translation has deviation, and the generated answer is definitely deviated from the expectation of the user; or a question that the question translation has no question but a deviation occurs with respect to the translation of the generated answer. In any case, the number of conversations is necessarily increased, and even the increased number of conversations cannot enable the user to obtain a satisfactory answer, so that the use experience of the user is seriously influenced, the processing time is increased due to translation, the waiting time of the user may be increased, and the use experience of the user is also influenced. There is also a case that when a user uses a plurality of languages in a question, it is difficult for the existing translation module to accurately grasp an accurate semantic meaning, and the final output answer is only an answer of one language, which may not conform to the language habits of the user, so that the user feels that the system is not intelligent enough. For example, a plurality of foreign enterprises or joint venture enterprises, the habit of Chinese and English mixed phrases is ubiquitous, or some scientific research institutions or enterprises, the research field is often the alternation of Chinese and English phrases, even in some areas, such as the pearl triangle area, the combination alternation of Chinese and English plus Yue-tongue often appears, and the long triangle area, the combination alternation of Chinese and English plus Wu-tongue often appears. In addition, there is a case that more than three languages are used simultaneously in some international organizations participating in multiple countries, and it is not necessary that the same user uses multiple languages simultaneously, and it is also possible that different users use different languages to perform interactive conversation with the intelligent question-answering system. The existing intelligent question-answering system can process a small number of voices, only supports a limited number of commonly used languages from the viewpoint of saving development and use cost, or improves the processing efficiency of the intelligent question-answering system by switching language modes. And the existing intelligent question-answering system outputs the answer with the highest score when the answer with high quality cannot be obtained, but the answer may still be seriously deviated from the range of the answer expected to be obtained by the user.
The present application is to solve the above-mentioned problems, and the inventor of the present application finds that the core of the above-mentioned problems is transition dependent translation, and further analyzes that the root of the generation of the dependent translation is that the natural language processing model is only solidly trained in one language or a limited number of languages during the development, so that the natural language processing model is strongly coupled with the languages. Therefore, the inventor of the application provides an invention idea of separating the natural language processing algorithm from the language of the natural language, introduces a multilingual word list which can be dynamically changed according to actual needs, separates the processing logic from the language related to the natural language, and introduces the multilingual word list in the training stage of the natural language processing model, thereby weakening the influence of the stripped language on the model processing capability to the maximum extent. Therefore, the occurrence of multiple translations is avoided, the generated answers can be used in a multi-language mixed mode, the intelligent question-answering system can be felt to be more intelligent and have strong speciality for the user, the user can feel the feeling of conversation with technical experts or the feeling of conversation with colleagues, and the use experience of the user is improved.
The technical solution of the present application will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a flowchart of a multilingual question-answering method according to an embodiment of the present disclosure. As shown in fig. 2, the multilingual question-answering method provided in this embodiment specifically includes:
s201, acquiring a question text input by a user.
In this step, the question text includes at least one language text.
Specifically, a user inputs a question in a text form on an intelligent question and answer input interface through a keyboard or a touch screen, or inputs voice through a microphone, and the intelligent question and answer system converts the voice question into a question in a text form through a voice recognition module, namely, obtains a question text input by the user. The question text input by the user can contain a plurality of languages, words of the plurality of languages can be alternately used in one sentence, or the plurality of languages can be alternately used between short sentences. For example, the user enters "who is the winner of the current hippocampal larch race? Also telling me that is there a how many candy tablet in this composition? "
S202, determining a paragraph to be selected corresponding to the question text from a preset resource library by using the paragraph selection model and the multilingual word list.
In this step, the multilingual vocabulary is used to parse the vocabularies of all languages in the question text, and the paragraph to be selected output by the paragraph selection model matches with the language features of the input question text.
Specifically, the problem text may be input into a search engine, retrieved from a local area network or the internet, all text paragraphs in a web page related to the problem text are extracted, and word segmentation is performed to obtain a plurality of paragraphs to be selected. Optionally, the problem text may also be input into a database search engine, document files corresponding to the problem text are searched in a preset database, and the document files are subjected to word segmentation processing and are split into a plurality of paragraphs to be selected.
It should be noted that, because the input question text contains multiple languages, the question text needs to be segmented and split first, and the segmented words of a certain language are expanded into a complete sentence by using the multilingual vocabulary, or the words are replaced by translations in combination with the corresponding translations of the words in another language, so as to form a question sentence with only one language, or to search the multilingual vocabulary for the segmented words or conventional sentences without similar multilingual combinations. For example, "I takes multilingual combination phrases such as You", "Walmart (wallmart) supermarket", "CNN model", "ev (electronic vehicle) new energy vehicle", and the like. The problem texts processed according to the method are respectively input into a web search engine or a database retrieval engine to obtain a paragraph to be selected which only contains one language or a paragraph to be selected which contains a plurality of languages. And the language features of the paragraph to be selected and the language features of the question text are matched with each other.
S203, determining answers of the question texts from the paragraphs to be selected by using the answer generation models and the multilingual word lists.
In this step, the answer output by the answer generation model is matched with the language features of the input paragraph to be selected. Specifically, the output answer may be an answer text sentence composed of a to-be-selected paragraph in a certain language in the to-be-selected paragraph, or an answer text sentence composed of several languages of to-be-selected paragraphs. The multilingual question-answering system can display the answer texts on a display screen or convert the answer texts into voice through a text-to-voice conversion module and play the voice.
It should be noted that, for the answer generation model, those skilled in the art can select the answer generation model according to actual needs, for example, a CNN model, an RNN model, etc. can generate a model of natural language dialogue processing. When the answer is generated, the multilingual vocabulary is called as necessary to connect the selected paragraphs, or the multilingual vocabulary is used for splitting and recombining the paragraphs to be selected of different languages, so that the answer text which contains multiple languages and can be understood by the user is obtained.
It should be further noted that the answer generated by the answer generation model may be a plurality of answers, and then the answer with the highest quality is selected by the scoring module or the scoring algorithm in the answer generation model.
Optionally, when the scores corresponding to all the answers are too low or lower than a preset threshold, the answer is selected from the preset reply sentence and the generated answer text is not output. Since too low a score means that the generated answer has no practical significance, is either not understandable by the user, or deviates too much from the user's original expectations. In this case, terms such as "what you can't understand you say," ask you refine your question again, "ask me do not answer," are output to guide the user to ask and answer the question, so as to obtain more useful information to generate the answer desired by the user, instead of directly outputting the generated low-quality answer, so that the user feels that the question-answering system answers at a mess, thereby affecting the user experience.
The application provides a multilingual question-answering method, which comprises the steps of firstly obtaining a question text input by a user, then determining a to-be-selected paragraph corresponding to the question text from a preset resource library by using a paragraph selection model and a multilingual word list, and finally determining an answer of the question text from the to-be-selected paragraph by using an answer generation model and the multilingual word list. The method and the device realize that multiple translations are avoided by using the multilingual word list in the stage of searching the paragraph to be selected and generating the answer, and solve the problems that the multiple translations take longer time and the semantics are possibly biased to influence the answer quality, thereby influencing the user experience. The method and the device achieve the technical effects of quickly generating multi-language question and answer answers, providing a convenient and efficient solution for the international popularization of the intelligent question and answer system or the use of a multi-language environment, and avoiding the need of setting language modes or selecting intelligent question and answer system products corresponding to different languages for users.
For ease of understanding, a specific implementation of determining the paragraph to be selected is described below in conjunction with fig. 3.
Fig. 3 is a schematic flowchart illustrating a process of determining a to-be-selected paragraph by a multilingual question-and-answer method according to an embodiment of the present application. As shown in fig. 3, the specific implementation step of determining the paragraph to be selected provided in this embodiment specifically includes:
s301, determining a retrieval result corresponding to the question text from a preset resource library by using a retrieval model.
In this step, the predetermined resource pool includes a document pool containing a plurality of languages, a local area network web page resource pool, an internet web page resource pool, and the like. The retrieval model comprises a Baidu search engine, a Google search engine, a dog search engine and other search engines, and also comprises various database retrieval models and various data retrieval algorithms.
The retrieval result comprises: the web page or the web address corresponding to the web page, and the document or the storage position corresponding to the document.
In one possible design, the specific implementation of this step can be described in detail as follows:
firstly, a problem text cleaning algorithm is utilized to clean the content of the problem text so as to determine the cleaning problem text meeting the preset format. The function of this step is to screen out some wrong or meaningless symbols, including special symbols and blank symbols such as "," there is no other content between two commas, and one of the commas can be deleted. Or when the input content is incomplete, the input content can be complemented by a cleaning algorithm, and if the input problem text is the 'nearest supermarket', the cleaning algorithm can complement the input problem text into the 'nearest supermarket' or the 'cheapest supermarket', and the like.
And then, determining a retrieval result to be cleaned according to the cleaning problem text and the retrieval model. There are at least two implementations of this step:
one implementation is: firstly, inputting the cleaning question text into a search engine to determine a webpage address; and then capturing the content in the webpage corresponding to the webpage address by using a content capturing model as a retrieval result to be cleaned. The other realization mode is as follows: firstly, screening an adaptive answer file corresponding to a cleaning question text in a preset question and answer file library by using a matching algorithm; and then determining a retrieval result to be cleaned according to the adaptive answer file and the content grabbing model. It can be understood that the above two modes can also be combined, that is, the question and answer file library is not only captured in the web page, but also preset for screening.
And finally, determining a retrieval result according to the retrieval result to be cleaned and a result cleaning algorithm. Specifically, a result cleaning algorithm is utilized to carry out semantic screening on the text data of the retrieval result to be cleaned, and the text without practical significance is deleted, or the text with low relevance degree with the problem text is deleted. Optionally, the search results to be cleaned may be organized into a preset format, including that the data captured from the web page is divided into K paragraphs (the size of K varies with the data of the web page) according to the paragraph marker "\ n" and returned. For example, the preset format of the search result comprises 30 word segmentation vectors, but the number of the word segmentation of the search result to be cleaned is not more than 30, and at the moment, the word segmentation is completed into 30 non-word vectors in a 0 complementing mode; it is understood that if the number exceeds 30, the search result data is split into two pieces.
The data of the retrieval result to be cleaned is cleaned, retrieval data with higher relevance degree with the question text can be obtained, because the retrieval result contains multi-language questions, more irrelevant data can be obtained during retrieval, and therefore, the multi-language combined cleaning is needed, and the multi-language combined cleaning is not only carried out aiming at single language in the prior art, so that the cleaned retrieval result can be ensured to meet the requirements of a multi-language question-and-answer system, otherwise, the quality of multi-language answers generated according to the retrieval result is not good.
S302, determining a paragraph text according to the retrieval result and the multilingual word list by using a preset word segmentation algorithm.
In this embodiment, the preset word segmentation algorithm is set as a complete marker (FullTokenizer) model, when the complete marker model performs word segmentation processing on a text, each segmented word unit (Token) is segmented, if a corresponding word can be found in the basic semantic library, the segmented word unit is an effective segmented word unit, and if a corresponding word cannot be found in the basic semantic library, the table is labeled as UNK, which means that the un cannot be identified by the model. In the application, the multilingual vocabulary is used as the basic semantic library, so that the effective rate of the participle unit obtained after the complete marker model is processed can be greatly improved, and the basic semantic library of a single language cannot identify the participles of other languages. Therefore, the multi-language word list can improve the effective rate of the word segmentation unit, further improve the model expression capability of the multi-language question-answering system, and generate more reasonable and effective answers.
Fig. 4 is a schematic diagram illustrating a to-be-selected paragraph determination process according to an embodiment of the present application. As shown in fig. 4, the bottom "question, paragraph 1" through "question, paragraph k" is k paragraph texts, [ CLS ]: the first participle unit, which represents the text of a paragraph, can be understood as the start tag, [ SEP ]: the separators between sentences in the paragraph text are expressed, which can be understood as sentence separation marks, and Tok1 to Tokk or Tokm denote the respective participle units (Token) in the paragraph text.
For ease of understanding, the following steps are described in conjunction with fig. 4.
S303, determining a first text vector according to the paragraph text and the combination algorithm.
In this step, each participle unit in the paragraph text has a corresponding index number in the multilingual word list, and it can be understood that if the participle unit has no corresponding word in the multilingual word list, the participle unit is assigned with an unrecognizable identifier, for example, 0. The word segmentation units in the paragraph text are all replaced by index numbers. As shown in fig. 4, the first text vector is obtained by adding the line vectors formed after replacement at the same position, because the dimension size of each paragraph text output by the complete marker model is consistent, in this embodiment, the dimension value is 768, and as described above, if the dimension of a single paragraph text is less than 768, the first text vector is complemented by 0. Then, the word segmentation units at the same positions of all paragraph texts are superposed and combined to obtain a first text vector with the same dimension of 768. For example, "E (cls)," E1, …, En, E (sep), "E1, …, Em" in fig. 4 are the first text vectors.
S304, determining a second text vector according to the first text vector and the paragraph selection model.
In this embodiment, the paragraph selection model is a BERT (Bidirectional Encoder from transforms) model, and it should be noted that the BERT model needs to be pre-trained with a multilingual vocabulary. As shown in fig. 4, after the first text vector "E (cls)," E1, …, En, E (sep), "E1, …, Em" is input into the BERT model, the second text vector "C, T1, …, Tk, T (sep)," T1, …, Tm "with the same dimension can be obtained.
The paragraph selection model has the function of recombining and expanding retrieval results on the basis of the multilingual word list so as to obtain more semantic materials which are more fit with the problems and provide support for generating more accurate multilingual answers later.
It should be noted that, a person skilled in the art may select a specific implementation manner of the paragraph selection model according to practical situations, and is not limited to the BERT model described in this embodiment.
S305, determining the paragraph to be selected according to the second text vector and a preset screening algorithm.
In this embodiment, the steps specifically include: extracting a first word segmentation vector of the second text vector; determining the selection probability of the paragraph text according to the first word segmentation vector and a preset decoding model; and if the selection probability is greater than or equal to a preset probability threshold, taking the paragraph text as the paragraph to be selected.
As shown in fig. 4, a first word segmentation unit, i.e. a first word segmentation vector "C" of the second text vector is extracted and decoded by using a preset filtering algorithm. Specifically, the first word-dividing vector "C" is sequentially passed through: after a fully-connected layer (768 × 768 linear layer), an active layer, a Dropout layer (coefficient 0.2), a paragraph linear layer, and a Softmax layer (the probabilities of K paragraphs are normalized to make the sum of the probabilities 1), sorting the probability arrays with the length of K from large to small, and selecting a paragraph with a probability rank meeting a preset condition as a to-be-selected paragraph, for example, a paragraph with a higher probability of N top-ranked bits may be selected, or a paragraph with a probability value greater than a preset threshold may be selected, and a person skilled in the art may select a specific screening form according to an actual situation, which is not limited in the present application.
For ease of understanding, one possible implementation of step S203 is specifically described below in conjunction with fig. 5.
Fig. 5 is a schematic flowchart of an answer generation method with multiple languages according to an embodiment of the present application. As shown in fig. 5, the specific steps of determining the answer of the question text from the paragraph to be selected by using the answer generation model and the multi-language vocabulary in the embodiment of the present application are as follows:
s501, combining the question text and the paragraph to be selected to generate a third text vector.
In this step, the paragraph vector corresponding to the problem text after the cleaning and word segmentation is combined with the paragraph vector to be selected to form a third text vector. The combination mode can be that the word segmentation unit of the question text is placed in front of the third text vector, the paragraph to be selected is placed behind the third text vector, and the sentence separation mark SEP is added in the middle of the third text vector. It can be understood that there are many combinations, and the participle of the question text may be placed at the tail of the third text vector, or may be interspersed between the paragraphs to be selected, and those skilled in the art may select the combination form according to a specific application scenario, which is not limited in the present application.
Fig. 6 is a schematic diagram illustrating the principle of generating answers by a multilingual question-answering method according to an embodiment of the present application. As shown in fig. 6, the word segmentation unit corresponding to the question text is placed in front of the third text vector, the first word segmentation unit of the third text vector is marked as [ CLS ] start paragraph, and then each paragraph to be selected is sequentially arranged and separated by sentence separation mark [ SEP ].
And S502, determining a sentence coding vector according to the third text vector and the language representation model.
In this step, the language representation model is matched to the multilingual vocabulary. Specifically, as shown in fig. 6, in this embodiment, a BERT model is selected as a language representation model, the third text vector is input into the BERT model, and after being processed by the BERT model, a multilingual answer text material is generated by combining a multilingual vocabulary, so as to expand resource storage for subsequently generating multilingual answers, thereby generating a multilingual answer with higher quality. The output of the BERT model is the so-called sentence-code vector.
It should be noted that the BERT model needs to be pre-trained with the multilingual vocabulary in advance, so that the paragraphs to be selected can be expanded and enriched in a more targeted manner by combining the multilingual vocabulary.
Optionally, the sentence encoding vector includes: the method comprises the following steps of paragraph segmentation, multi-language vocabulary segmentation, first attribute information and second attribute information, wherein the first attribute information is attribute information corresponding to the paragraph segmentation, and the second attribute information is attribute information corresponding to the multi-language vocabulary segmentation. Specifically, the attribute information may include index position information of a paragraph participle or a multilingual vocabulary participle in a paragraph text or a multilingual vocabulary, a part of speech of the participle (e.g., whether the participle represents a person name, a place name, or an action, etc.), a characteristic attribute of the participle (e.g., an application scenario represented by the participle, such as that a characteristic attribute of a supermarket is shopping), and the like.
And S503, determining the grade of the sentence coding vector according to the sentence coding vector and the grading decoding model.
In this step, the concrete steps include:
s503_1, extracting a second word segmentation vector positioned at the head of the sentence coding vector.
As shown in fig. 6, the first bit word segmentation unit "C" of the sentence coding vector is taken as the second word segmentation vector, extracted, and input to the answer scoring decoding layer shown in fig. 6.
S503_2, determining the score of the sentence coding vector according to the second word segmentation vector and the score decoding model.
As shown in fig. 6, the second word segmentation vector "C" is sequentially passed through: fully connected layers (768 x 768 linear layers), active layers, Dropout layers (coefficient 0.2), paragraph linear layers, Softmax layers (so that the sum of the score with a valid answer and the score without a valid answer is 1).
Fig. 7 is a schematic diagram illustrating a change of a vector dimension when an answer scoring decoding layer decodes an input vector according to an embodiment of the present disclosure. As shown in fig. 7, [ CLS ] is a second segmentation vector, which is a 1 × 768-dimensional vector, the dimension of which is unchanged after passing through the full-connection layer, the active layer, and the dropout layer, the dimension of which is 1 × 2 after passing through the paragraph linear layer, and finally an array with a length of 2 is output after passing through the Softmax layer, and an index 0 position and an index 1 position are respectively set for two numbers in the array. The index 0 position represents the answer score with no valid answer, and the index 1 position represents the answer score with a valid answer. That is, for a sentence coding vector, the scoring decoding model predicts whether an answer generated by the sentence coding vector is valid from two directions, respectively, and certainly, in the evaluation, a constraint condition that the sum of a valid score and an invalid score is 1 needs to be satisfied. And then screening whether the answer needs to be generated or not by setting a preset scoring threshold. The preset scoring threshold may be a fixed value, such as 0.5, or may be a dynamic value generated according to an input question and/or a processing intermediate parameter of the BERT model, and a person skilled in the art may select a setting manner of the preset scoring threshold according to a specific application scenario.
S504, if the score is larger than or equal to the preset score threshold, the answer generation model and the multilingual word list are used for determining the answer corresponding to the sentence coding vector.
In this step, the concrete steps include:
s5041, decoding the sentence-encoded vector using a translation decoding model to determine a decoded word vector.
As shown in fig. 6, after the score of the decomposition code model is scored, if the score of the effective answer of the sentence coding vector is greater than or equal to the preset score threshold, the sentence coding vector is input into the conversion decoding model for decoding, so as to obtain a decoded word vector, which is used for extracting the answer text material in the sentence coding vector. Then input into the pointer network, and the pointer network combines the multilingual word lists to organize and generate the final answer. The working principle of the pointer network is explained below with reference to fig. 8 and the following steps.
S5042, determining a paragraph score matrix according to the first multi-layer perception network and the decoded word vector.
Fig. 8 is a schematic diagram of an operating principle of a pointer network according to an embodiment of the present application. As shown in fig. 8, the decoded word vectors output by the transform decoding model are respectively input into two different multilayer perceptual networks, namely, a first multilayer perceptual network and a second multilayer perceptual network. The multilayer perception network is a network consisting of a series of multilayer linear layers and an activation layer. The first multilayer perception network is used for obtaining a paragraph score matrix, and the paragraph score matrix is Tlen×SlenFractional matrix of, TlenIndicates the length of the sequence of answers, SlenRepresenting the length of the decoded word vector, it being understood that each row of the score matrix represents an answer participle unit, corresponding to a score in each decoded word vector, and thus, it being understood that a segmentThe score-dropping matrix represents the corresponding score of each participle unit of the answer in the decoded word vector. For example, one row of the paragraph score matrix is truncated to see [0.3,0.1,0,0.8 ]]0.3 denotes the score obtained by a certain answer segmentation unit Tokn in the first decoded word vector, and 0.8 denotes the score obtained in the fourth decoded word vector. The role of the paragraph score matrix is to enable selection of the corresponding word from the decoded word vector when generating the answer, e.g., a higher score indicates a greater probability of being selected.
And S5043, determining a word list fraction matrix according to the second multilayer perception network and the decoded word vector.
The second multilayer perception network is used for obtaining a word list fraction matrix which is Tlen×VlenFractional matrix of, TlenIndicates the length of the sequence of answers, VlenIndicating the length of the multilingual vocabulary.
The role of the vocabulary score matrix is to enable the selection of corresponding words from the multi-lingual vocabulary when generating answers.
S5044, determining a comprehensive score matrix according to the paragraph score matrix and the vocabulary score matrix by using a superposition algorithm.
In this step, as shown in fig. 8, different positions of each matrix have respective index numbers, and then the segment score matrix and the score of the index number corresponding to the vocabulary score matrix can be controlled to be added or subtracted according to a preset rule, for example, the score of the same index number can be added/subtracted, that is, the first position of the segment score matrix and the first position of the vocabulary score matrix are added/subtracted. The person skilled in the art can select different stacking forms according to specific situations, and the application is not limited.
S5045, determining an answer according to the comprehensive score matrix by using an answer generation model, the sentence coding vector and the multi-language word list.
The answer generation model determines whether to select the participle unit in the multilingual word list or to select the participle in the decoded word vector for combination according to the score of each answer participle unit given in the comprehensive score matrix to obtain the final answer.
And S505, if the score is smaller than a preset score threshold value, generating an answer according to a preset reply template.
In this step, after scoring the decoding model, if the score of the effective answer of the sentence coding vector is smaller than the preset score threshold, the sentence coding vector is considered to be unable to generate the answer satisfactory to the user, and at this time, one answer is directly selected from the reply template to generate, for example: similarly, "you ask a question beyond my understanding, please change a question or question. "this reply, rather than giving an answer that is meaningless to the user.
The application provides a multilingual question-answering method, which comprises the steps of firstly obtaining a question text input by a user, then determining a to-be-selected paragraph corresponding to the question text from a preset resource library by using a paragraph selection model and a multilingual word list, and finally determining an answer of the question text from the to-be-selected paragraph by using an answer generation model and the multilingual word list. The method and the device realize that multiple translations are avoided by using the multilingual word list in the stage of searching the paragraph to be selected and generating the answer, and solve the problems that the multiple translations take longer time and the semantics are possibly biased to influence the answer quality, thereby influencing the user experience. The method and the device achieve the technical effects of quickly generating multi-language question and answer answers, providing a convenient and efficient solution for the international popularization of the intelligent question and answer system or the use of a multi-language environment, and avoiding the need of setting language modes or selecting intelligent question and answer system products corresponding to different languages for users.
Fig. 9 is a schematic structural diagram of a multilingual question-answering apparatus according to an embodiment of the present application. As shown in fig. 9, the multilingual question-answering apparatus 900 according to the present embodiment includes:
an obtaining module 901, configured to obtain a question text input by a user;
a paragraph selection module 902, configured to determine a paragraph to be selected according to the question text by using a paragraph selection model and a multilingual vocabulary, where the multilingual vocabulary includes vocabularies of all languages related to the question text;
an answer generating module 903, configured to determine an answer to the question text from the to-be-selected paragraph by using an answer generating model and the multilingual vocabulary, where the answer output by the answer generating model matches with the language feature of the input to-be-selected paragraph.
In one possible design, the paragraph selection module 902 is configured to determine a paragraph to be selected according to the question text by using a paragraph selection model and a multilingual vocabulary, where the multilingual vocabulary includes vocabularies of all languages related to the question text, and includes:
the paragraph selection module 902 is configured to determine, by using a search model, a search result corresponding to the question text from the preset resource library;
the paragraph selection module 902 is further configured to determine a paragraph text according to the retrieval result and the multilingual vocabulary by using a preset word segmentation algorithm;
the paragraph selection module 902 is further configured to determine a first text vector according to the paragraph text and a combination algorithm;
the paragraph selection module 902 is further configured to determine a second text vector according to the first text vector and a paragraph selection model;
the paragraph selecting module 902 is further configured to determine the paragraph to be selected according to the second text vector and a preset filtering algorithm.
In a possible design, the paragraph selecting module 902 is further configured to determine the paragraph to be selected according to the second text vector and a preset filtering algorithm, where the determining includes:
the paragraph selection module 902 is further configured to extract a first word segmentation vector of the second text vector;
the paragraph selection module 902 is further configured to determine a selection probability of the paragraph text according to the first word segmentation vector and a preset decoding model;
the paragraph selecting module 902 is further configured to, if the selection probability is greater than or equal to a preset probability threshold, take the paragraph text as the to-be-selected paragraph.
Optionally, the paragraph selecting module 902 is configured to determine, by using a search model, a search result corresponding to the question text from the preset resource library, where the search result includes:
the paragraph selection module 902 is configured to perform content cleaning on the question text by using a question text cleaning algorithm to determine a cleaning question text meeting a preset format;
the paragraph selection module 902 is further configured to determine a retrieval result to be cleaned according to the cleaning question text and the retrieval model;
the paragraph selecting module 902 is further configured to determine the search result according to the search result to be cleaned and a result cleaning algorithm.
In one possible design, the paragraph selection module 902 is further configured to determine a search result to be cleaned according to the cleaning question text and the search model, and includes:
the paragraph selection module 902 is further configured to input the cleaning question text into a search engine to determine a web page address;
the paragraph selecting module 902 is further configured to use a content crawling model to crawl content in a webpage corresponding to the webpage address as the to-be-cleaned retrieval result.
In one possible design, the paragraph selection module 902 is further configured to determine a search result to be cleaned according to the cleaning question text and the search model, and includes:
the paragraph selecting module 902 is further configured to screen an adapted answer file corresponding to the cleaning question text in a preset question and answer file library by using a matching algorithm;
the paragraph selecting module 902 is further configured to determine the search result to be cleaned according to the adapted answer file and the content grabbing model.
Optionally, the answer generating module 903 is configured to determine an answer to the question text from the paragraph to be selected by using an answer generating model and the multilingual vocabulary, and includes:
the answer generating module 903 is configured to combine the question text and the paragraph to be selected to generate a third text vector;
the answer generating module 903 is further configured to determine a sentence encoding vector according to the third text vector and a language representation model, where the language representation model is matched with the multilingual vocabulary;
the answer generating module 903 is further configured to determine scores of the sentence coding vectors according to the sentence coding vectors and a score decoding model;
the answer generating module 903 is further configured to determine the answer corresponding to the sentence coding vector by using the answer generating model and the multi-language vocabulary if the score is greater than or equal to a preset score threshold;
the answer generating module 903 is further configured to generate the answer according to a preset reply template if the score is smaller than the preset score threshold.
In one possible design, the answer generation module 903 is further configured to determine scores of the sentence coding vectors according to the sentence coding vectors and a score decoding model, and includes:
the answer generating module 903 is further configured to extract a second word segmentation vector located at the head of the sentence coding vector;
the answer generating module 903 is further configured to determine scores of the sentence coding vectors according to the second segmentation vectors and a score decoding model.
Optionally, the answer generating module 903 is further configured to determine the answer corresponding to the sentence coding vector by using the answer generating model and the multi-language vocabulary if the score is greater than or equal to a preset score threshold, where the determining includes:
if the score is greater than or equal to a preset score threshold, the answer generation module 903 is further configured to decode the sentence coding vector by using a conversion decoding model to determine a decoded word vector;
the answer generating module 903 is further configured to determine a paragraph score matrix according to the first multilayer perceptual network and the decoded word vector;
the answer generating module 903 is further configured to determine a word list score matrix according to a second multilayer perceptual network and the decoded word vector;
the answer generating module 903 is further configured to determine a comprehensive score matrix according to the paragraph score matrix and the vocabulary score matrix by using a superposition algorithm;
the answer generating module 903 is further configured to determine the answer according to the comprehensive score matrix by using the answer generating model, the sentence encoding vector, and the multilingual word list.
Optionally, after the obtaining module 901 is configured to obtain the question text input by the user, the method further includes:
a preprocessing module 904, configured to perform word segmentation on the question text by using a word segmentation algorithm to determine a question text word vector;
the preprocessing module 904 is further configured to determine a language corresponding to the question text according to the vector of the question text and a language matching algorithm;
the preprocessing module 904 is further configured to determine the multilingual vocabulary from a multilingual vocabulary to be selected according to the language, where the multilingual vocabulary includes at least two languages.
It should be noted that the multilingual question-answering apparatus provided in the embodiment shown in fig. 9 can be used to execute the multilingual question-answering method provided in any of the above embodiments, and the specific implementation manner and technical effect are similar, and are not described herein again.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device 1000 provided in this embodiment includes:
a processor 1001; and the number of the first and second groups,
a memory 1002 for storing executable instructions of the processor, which may also be a flash (flash memory);
wherein the processor 1001 is configured to perform the steps of the above-described method via execution of the executable instructions. Reference may be made in particular to the description relating to the preceding method embodiment.
Alternatively, the memory 1002 may be separate or integrated with the processor 1001.
When the memory 1002 is a device independent of the processor 1001, the electronic device 1000 may further include:
a bus 1003 is used to connect the processor 1001 and the memory 1002.
The present embodiment also provides a readable storage medium, in which a computer program is stored, and when at least one processor of the electronic device executes the computer program, the electronic device executes the methods provided by the above various embodiments.
The present embodiment also provides a program product comprising a computer program stored in a readable storage medium. The computer program can be read from a readable storage medium by at least one processor of the electronic device, and the execution of the computer program by the at least one processor causes the electronic device to implement the methods provided by the various embodiments described above.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (14)

1. A multilingual question-answering method, comprising:
acquiring a question text input by a user, wherein the question text comprises at least one language text;
determining a paragraph to be selected corresponding to the problem text from a preset resource library by using a paragraph selection model and a multilingual vocabulary, wherein the multilingual vocabulary is used for analyzing vocabularies of all languages in the problem text, and the paragraph to be selected output by the paragraph selection model is matched with the language characteristics of the input problem text;
and determining an answer of the question text from the paragraph to be selected by using an answer generation model and the multi-language word list, wherein the answer output by the answer generation model is matched with the language feature of the input paragraph to be selected.
2. The multilingual question-answering method of claim 1, wherein the determining, from a predetermined resource library, the to-be-selected paragraph corresponding to the question text using the paragraph selection model and the multilingual vocabulary includes:
determining a retrieval result corresponding to the question text from the preset resource library by using a retrieval model;
determining a paragraph text according to the retrieval result and the multilingual word list by using a preset word segmentation algorithm;
determining a first text vector according to the paragraph text and a combination algorithm;
determining a second text vector according to the first text vector and a paragraph selection model;
and determining the paragraph to be selected according to the second text vector and a preset screening algorithm.
3. The multilingual question-answering method of claim 2, wherein the determining the to-be-selected paragraph according to the second text vector and a preset filtering algorithm comprises:
extracting a first word segmentation vector of the second text vector;
determining the selection probability of the paragraph text according to the first word segmentation vector and a preset decoding model;
and if the selection probability is greater than or equal to a preset probability threshold, taking the paragraph text as the paragraph to be selected.
4. The multilingual question-answering method according to claim 2 or 3, wherein the determining, by using a search model, a search result corresponding to the question text from the predetermined repository comprises:
cleaning the content of the problem text by using a problem text cleaning algorithm to determine the cleaning problem text meeting a preset format;
determining a retrieval result to be cleaned according to the cleaning problem text and the retrieval model;
and determining the retrieval result according to the retrieval result to be cleaned and a result cleaning algorithm.
5. The multilingual question-answering method of claim 4, wherein the determining a search result to be cleaned from the cleaning question text and the search model comprises:
inputting the cleaning question text into a search engine to determine a webpage address;
and capturing the content in the webpage corresponding to the webpage address by using a content capturing model as the retrieval result to be cleaned.
6. The multilingual question-answering method of claim 4, wherein the determining a search result to be cleaned from the cleaning question text and the search model comprises:
screening an adaptive answer file corresponding to the cleaning question text in a preset question and answer file library by using a matching algorithm;
and determining the retrieval result to be cleaned according to the adaptive answer file and the content grabbing model.
7. The multilingual question-answering method of claim 1, wherein the determining the answer to the question text from the to-be-selected passage using an answer generation model and the multilingual vocabulary comprises:
combining the question text and the paragraph to be selected to generate a third text vector;
determining sentence coding vectors according to the third text vectors and a language representation model, wherein the language representation model is matched with the multilingual word list;
determining the grade of the sentence coding vector according to the sentence coding vector and a grading decoding model;
if the score is larger than or equal to a preset score threshold value, determining the answer corresponding to the sentence coding vector by using the answer generation model and the multilingual word list;
and if the score is smaller than the preset score threshold value, generating the answer according to a preset reply template.
8. The multilingual question-answering method of claim 7, wherein the sentence-encoding vector comprises: the method comprises the following steps of paragraph segmentation, multi-language vocabulary segmentation, first attribute information and second attribute information, wherein the first attribute information is attribute information corresponding to the paragraph segmentation, and the second attribute information is attribute information corresponding to the multi-language vocabulary segmentation.
9. The multilingual question-answering method of claim 8, wherein the determining the score of the sentence-coding vector according to the sentence-coding vector and a score-decoding model comprises:
extracting a second word segmentation vector positioned at the head of the sentence coding vector;
and determining the grade of the sentence coding vector according to the second word segmentation vector and a grading decoding model.
10. The multilingual question-answering method according to claim 8 or 9, wherein the determining the answer corresponding to the sentence-coding vector using the answer generation model and the multilingual vocabulary comprises:
decoding the sentence encoding vector by using a conversion decoding model to determine a decoding word vector;
determining a paragraph score matrix according to the first multilayer perception network and the decoding word vector;
determining a word list fraction matrix according to a second multilayer perception network and the decoding word vector;
determining a comprehensive score matrix according to the paragraph score matrix and the vocabulary score matrix by using a superposition algorithm;
and determining the answer according to the comprehensive score matrix by using the answer generation model, the sentence coding vector and the multilingual word list.
11. The multilingual question-answering method according to any one of claims 1 to 3 or 7 to 9, further comprising, after the obtaining of the user-entered question text:
performing word segmentation on the problem text by using a word segmentation algorithm to determine a problem text word vector;
determining the language corresponding to the question text according to the vector of the question text and a language matching algorithm;
and determining the multilingual word list from the multilingual word list to be selected according to the languages, wherein the multilingual word list at least comprises two languages.
12. A multilingual question-answering apparatus, comprising:
the acquisition module is used for acquiring a question text input by a user;
a paragraph selection module, configured to determine a to-be-selected paragraph corresponding to the question text from a preset resource library by using a paragraph selection model and a multilingual vocabulary, where the multilingual vocabulary is used to parse words of all languages in the question text, and the to-be-selected paragraph output by the paragraph selection model matches with the language features of the input question text;
and the answer generating module is used for determining the answer of the question text from the paragraph to be selected by utilizing an answer generating model and the multi-language word list, and the answer output by the answer generating model is matched with the language characteristics of the input paragraph to be selected.
13. A multilingual question-answering apparatus, comprising:
a processor; and the number of the first and second groups,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the multilingual question-answering method of any of claims 1-11 via execution of the executable instructions.
14. A storage medium on which a computer program is stored, which program, when being executed by a processor, implements the multilingual question-answering method according to any one of claims 1 to 11.
CN202010728828.4A 2020-07-23 2020-07-23 Multilingual question and answer method and device, electronic equipment and storage medium Pending CN113971212A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010728828.4A CN113971212A (en) 2020-07-23 2020-07-23 Multilingual question and answer method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010728828.4A CN113971212A (en) 2020-07-23 2020-07-23 Multilingual question and answer method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113971212A true CN113971212A (en) 2022-01-25

Family

ID=79584597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010728828.4A Pending CN113971212A (en) 2020-07-23 2020-07-23 Multilingual question and answer method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113971212A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093696A (en) * 2023-10-16 2023-11-21 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093696A (en) * 2023-10-16 2023-11-21 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model
CN117093696B (en) * 2023-10-16 2024-02-02 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model

Similar Documents

Publication Publication Date Title
JP6618735B2 (en) Question answering system training apparatus and computer program therefor
JP6819990B2 (en) Dialogue system and computer programs for it
KR102256240B1 (en) Non-factoid question-and-answer system and method
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN110096567B (en) QA knowledge base reasoning-based multi-round dialogue reply selection method and system
CN114116994A (en) Welcome robot dialogue method
CN111090727B (en) Language conversion processing method and device and dialect voice interaction system
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
JP2000353161A (en) Method and device for controlling style in generation of natural language
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN100454294C (en) Apparatus and method for translating Japanese into Chinese and computer program product
JP2015045833A (en) Speech sentence generation device, and method and program for the same
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN113239666A (en) Text similarity calculation method and system
JP5718405B2 (en) Utterance selection apparatus, method and program, dialogue apparatus and method
CN111859950A (en) Method for automatically generating lecture notes
CN113971212A (en) Multilingual question and answer method and device, electronic equipment and storage medium
JP6126965B2 (en) Utterance generation apparatus, method, and program
CN116595970A (en) Sentence synonymous rewriting method and device and electronic equipment
Chanda et al. Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text?
CN114661864A (en) Psychological consultation method and device based on controlled text generation and terminal equipment
JP4153843B2 (en) Natural sentence search device, natural sentence search method, natural sentence search program, and natural sentence search program storage medium
CN112885338A (en) Speech recognition method, apparatus, computer-readable storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination