CN108959387B - Information acquisition method and device - Google Patents

Information acquisition method and device Download PDF

Info

Publication number
CN108959387B
CN108959387B CN201810550859.8A CN201810550859A CN108959387B CN 108959387 B CN108959387 B CN 108959387B CN 201810550859 A CN201810550859 A CN 201810550859A CN 108959387 B CN108959387 B CN 108959387B
Authority
CN
China
Prior art keywords
text
structured
query
query text
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810550859.8A
Other languages
Chinese (zh)
Other versions
CN108959387A (en
Inventor
马文涛
崔一鸣
齐乐
何苏
陈致鹏
王士进
胡国平
张宇
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201810550859.8A priority Critical patent/CN108959387B/en
Publication of CN108959387A publication Critical patent/CN108959387A/en
Application granted granted Critical
Publication of CN108959387B publication Critical patent/CN108959387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides an information acquisition method and device, and belongs to the technical field of computers. The method comprises the following steps: respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the explanatory document and is used for describing the information queried by the query text; and sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text. Because the description document can be disassembled to obtain the structured text, the structured text matched with the query text can be selected according to the matching probability between the query text and each structured text and used as a query result, and manual query is not needed, the efficiency of obtaining information is improved. In addition, the query basis is the description document associated with the product function, so that the accuracy and reliability of obtaining the message can be improved.

Description

Information acquisition method and device
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to an information acquisition method and device.
Background
In daily life, when a product is used, the product may not be immediately skilled for use. Usually, a user needs to refer to the specification of the product to obtain related information of the product function, so as to know how to use a certain function of the product. In the related art, two information acquisition modes are generally provided, the first mode is that a user searches an electronic specification by himself, namely, the user refers to each product function one by one in the electronic specification until finding out the product function information required by the user. Because the number of items of the product functions is usually more, the efficiency of looking up one by one is lower.
Disclosure of Invention
In order to solve the above problems, embodiments of the present invention provide an information acquisition method and apparatus that overcome the above problems or at least partially solve the above problems.
According to a first aspect of the embodiments of the present invention, there is provided an information acquisition method, including:
respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the explanatory document and is used for describing the information queried by the query text;
and sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text.
According to the method provided by the embodiment of the invention, the matching probability between each structured text and the query text is respectively obtained. And sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text. Because the description document can be disassembled to obtain the structured text, the structured text matched with the query text can be selected according to the matching probability between the query text and each structured text and used as a query result, and manual query is not needed, the efficiency of obtaining information is improved. In addition, the query basis is the description document associated with the product function, so that the accuracy and reliability of obtaining the message can be improved.
According to a second aspect of the embodiments of the present invention, there is provided an information acquisition apparatus including:
the first acquisition module is used for respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the description document and is used for describing the information queried by the query text;
and the selecting module is used for sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the preset number of matching probabilities as the structured texts matched with the query text.
According to a third aspect of embodiments of the present invention, there is provided an information acquisition apparatus including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor calling the program instructions being capable of performing the information obtaining method provided by any of the various possible implementations of the first aspect.
According to a fourth aspect of the present invention, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the information acquisition method provided in any one of the various possible implementations of the first aspect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of embodiments of the invention.
Drawings
Fig. 1 is a schematic flow chart of an information acquisition method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating contents of a structured document according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a keyword prediction model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a problem recognition model according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of an information obtaining method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a probability calculation model according to an embodiment of the present invention;
FIG. 7 is a block diagram of an information acquisition apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram of an information acquisition apparatus according to an embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following examples are intended to illustrate the examples of the present invention, but are not intended to limit the scope of the examples of the present invention.
At present, in the field of automobiles or electronic products, when a user uses a certain function, the user generally refers to a specification of how the function is used to obtain product function information. In view of the above situation, an embodiment of the present invention provides an information acquisition method. The method can be applied to various products, and the product forms can be electronic products, living goods, vehicles and the like, which are not particularly limited in the embodiment of the invention. In order to facilitate understanding, the method provided by the embodiment of the invention is explained by taking a product as an automobile as an example. Accordingly, the execution subject of the method may be a processor or the like on an on-board system or on-board equipment, and this is not particularly limited in the embodiment of the present invention. Referring to fig. 1, the method includes: 101. respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the explanatory document and is used for describing the information queried by the query text; 102. and sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text.
In the step 101, the query text is obtained by performing speech recognition on a speech signal acquired when the user asks a question; the description document is used to describe the functions of the product, and the description document may be an electronic document or an electronic document obtained by digitizing a paper specification, which is not specifically limited in this embodiment of the present invention. For example, in the specification document of the automobile, generally, a large number of functions are provided in the automobile, such as a driving-related function and an air conditioning function, and the specification document also describes a specific description of each function. In addition, before step 101 is executed, the description document may be broken down into a plurality of structured texts for each function described in the description document. Each structured document may correspond to a function of the product and is used to record a specific instruction corresponding to the function.
It should be noted that, in order to enable the user to accurately locate the product function information of the product function to be queried subsequently, when the description document is disassembled, the function with the smallest function granularity in the description document can be disassembled according to the division of the function granularity. For example, the air conditioning function in the automobile can be further divided into two functions, namely an air conditioning refrigeration adjusting function and an air conditioning heating adjusting function, which have the smallest granularity. Accordingly, each structured text corresponds to a function with the smallest granularity of the function. Meanwhile, if each structured text corresponds to one function title, the titles in each structured text are the titles at the lowest level. It should be noted that, if the method provided by the embodiment of the present invention is implemented by a product function guidance system, the function of disassembling the product description document may be implemented by a description document structured disassembling module under the product function guidance system.
In addition, the structured text corresponding to each function may contain a plurality of fields. For example, for an air conditioning and cooling adjustment function on an automobile, the structured text corresponding to the function may include a plurality of fields, such as a document title, main content (operation or function description), a prompt or warning, a notice, a corresponding picture list, and the like. As shown in fig. 2, fig. 2 is a structural text corresponding to the air-conditioning cooling adjustment function, and the structural text includes four fields, which are "air-conditioning cooling operation" (i.e. corresponding to the document title), "air quality assurance with ignition key … …" (i.e. corresponding to the main content under the document title), "important prompt: at … … air conditioning "(i.e., corresponding to a reminder or warning that the air conditioning refrigeration conditioning function is in use)," note: 1. natural ventilation at tepid … … "(i.e., corresponding to the attention of the air conditioning refrigeration conditioning function when in use).
It should be noted that, in addition to the five types of fields divided as described above, other types of fields may be included in the structured text, and this is not specifically limited in the embodiment of the present invention. In addition, if the structured text in the document only contains several fields of the five types of fields, the structured text can also contain the five types of fields by filling a default value of 'NULL', so that the formats of all the structured texts are uniform, and the structured text is convenient to be uniformly processed subsequently. Wherein the content in the padding field is empty.
As can be seen from the foregoing, the explanatory document can be disassembled into a plurality of structured texts. In order to find structured text associated with the query text that is being asked by the user, the probability of a match between each structured text and the query text may be obtained separately. The structured text related to the query text of the user query refers to the structured text containing the information that the user query wants to query. For any structural text, the embodiment of the present invention does not specifically limit the manner of obtaining the matching probability between the structural text and the query text, including but not limited to: the structured text and the query text are input into a probability calculation model, thereby outputting a matching probability between the structured text and the query text.
Wherein, the probability calculation model can be obtained by pre-training. Specifically, the sample structured text and the sample query text may be used as input of an initial model, an actual matching result between the sample structured text and the sample query text may be used as output of the initial model, and the initial model is trained to obtain a probability calculation model. The probability of a match between each structured text in the explanatory document and the query text can be obtained by a probabilistic calculation model. After obtaining the matching probability between each structured text and the query text, the structured text corresponding to the maximum matching probability can be used as the structured text matched with the query text. Where the maximum match probability indicates that the structured text is most likely to match the query text.
In actual implementation, of course, not only the structured text corresponding to the maximum matching probability may be taken as the structured text matched with the query text, but also each matching probability may be sorted from large to small, and the structured texts corresponding to the preset number of matching probabilities are taken as the structured text matched with the query text. At this time, a plurality of structured texts may be selected as the structured texts matched with the query text. It should be noted that the function of determining the structured text matching the query text may be implemented by a function calculation module under the product function instruction system.
After determining the structured text that matches the query text, keywords in the query text can be determined, so that usage guidance information can be subsequently extracted from the structured text based on the keywords for reference by the user when using the product. The keyword prediction model may be trained in advance before determining keywords in the query text. Specifically, the training text may be pre-segmented to obtain a corresponding training segmentation sequence, a weight is labeled for each segmentation in the training segmentation sequence, the training segmentation sequence is used as an input of an initial model, the weight of each segmentation is used as an output of the initial model, and a keyword prediction model may be obtained through training. The structure of the keyword prediction model can refer to fig. 3. After a keyword prediction model is obtained through training, segmenting a query text, inputting a text sequence obtained after segmenting into the keyword prediction model, outputting the weight corresponding to each segmentation in the query text, and selecting the segmentation with the weight larger than a preset threshold value as a keyword; or, the weights corresponding to the participles are ranked from large to small, and then the top preset number of the participles are selected as the keywords. The function of determining the keywords in the query text can be realized by a keyword calculation module under the product function guidance system.
It should be noted that, when parameters are input to the keyword prediction model, the number of the input participles may be a preset number. Specifically, when the number of the participles in the input text sequence is smaller than the preset number, the number of the participles in the participle sequence can be filled to the preset number in a filling mode, so that the filled participle sequence is input to the keyword prediction model to output the keywords in the participle sequence. The filling mode may be NULL filling, and the filled word segmentation content is empty. When the number of the participles in the input text sequence is larger than a preset threshold value, if one participle is added, the last participle in the participle sequence can be cut off in a truncation mode, so that the participle sequence with the last participle cut off is input into the keyword prediction model, and the keywords in the participle sequence are output.
The preset number can be expressed by max-len, and the value of the max-len can be continuously adjusted in the training process, so that the number of participles in a participle sequence input to the keyword prediction model is always close to the value of the max-len. After determining the keywords in the query text, the usage guidance information may be selected from the structured text based on the keywords in the query text. The function of selecting the use guidance information from the structured text can be realized by a guidance information conversion module under the product function guidance system.
For example, the query text is "what air conditioning needs to be attended to" as an example. If the keyword of the query text is determined to be "attention" and the content of the structured text is as shown in fig. 2, the following use guidance information can be selected from the structured text: "Note: "content in field: "1, under warm conditions, air … … should be selected to be naturally ventilated with fresh air circulating outside. After the use guidance information is selected, a guidance process can be initiated based on the use guidance information, and the use guidance information is broadcasted to the user in a voice broadcast mode. It should be noted that the function of selecting the use guidance information from the structured text may be implemented by a guidance information conversion module under the product function guidance system.
According to the method provided by the embodiment of the invention, the matching probability between each structured text and the query text is respectively obtained. And sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text. Because the description document can be disassembled to obtain the structured text, the structured text matched with the query text can be selected according to the matching probability between the query text and each structured text and used as a query result, and manual query is not needed, the efficiency of obtaining information is improved. In addition, the query basis is the description document associated with the product function, so that the accuracy and reliability of obtaining the message can be improved.
It is contemplated that the content of the question may not be relevant to the explanatory document when the user initiates the question. For example, if the query text of the user initiating the query is "how to go from top to west", the user queries a specific navigation route, and the description document records how each function in the vehicle navigation system is used, that is, records how to use the navigation instead of the specific navigation route, that is, the description document is irrelevant to the query text. In the above case, before the above step 101 is performed, it may be determined whether the query text is related to the explanatory document. When the two are correlated, step 101 is executed. Otherwise, the user can be reminded that the query does not belong to the category of the function query, and the navigation function can be switched to provide the navigation route for the user.
Based on the content of the foregoing embodiment, as an optional embodiment, the method further includes: inputting a text sequence into a first question recognition model, outputting a correlation degree value between the text sequence and an explanatory document, if the correlation degree value is greater than a preset threshold value, determining that the content between a query text and the explanatory document is related, wherein the text sequence is obtained by segmenting the query text, and the first question recognition model is obtained by training a sample text sequence and the correlation degree value between the sample text sequence and the explanatory document; or inputting the text sequence into a second question recognition model, and outputting a correlation result between the query text and the explanatory document, wherein the second question recognition model is obtained by training the sample text sequence and the correlation result between the sample text sequence and the explanatory document, and the text sequence is obtained by segmenting the query text.
The first problem identification model may be a classifier based on a convolutional neural network, and the correlation degree value between the text sequence and the explanatory document may be a probability value of correlation between the text sequence and the explanatory document, which is not specifically limited in this embodiment of the present invention. The first problem recognition model may be trained in advance before the above process is performed. Specifically, a large number of sample text sequences may be prepared in advance, and each sample text sequence may be labeled. Wherein, the sample text sequence related to the description document can be marked as 1, and the sample text sequence unrelated to the description document can be marked as 0. And inputting each sample text sequence and the corresponding label into the initial model, thereby training to obtain a first problem recognition model.
For example, the sample text sequence is exemplified by "navigate-how-use" and "navigate-to-west flag". For the sample text sequence "navigation-how-to-use", since the sample text sequence is related to the explanatory document by inquiring how to use the navigation function, the sample text sequence is labeled as 1. For the sample text sequence "navigate-to-west two flag", since the sample text sequence is the query navigation route and is not related to the description document, the corresponding label of the sample text sequence is 0. And training to obtain a first problem recognition model based on the sample text sequence and the corresponding label.
Taking the sample text sequence as "air conditioning-cooling-how-operating-tweed", the structure of the first problem identification model may be as shown in fig. 4. In fig. 4, the embedded Layer, the volumetric Layer, the Pooling Layer, and the Output Layer are respectively shown as an embedded Layer, a capacitive Layer, and an Output Layer.
After the first problem recognition model is obtained through training, the text sequence can be input into the first problem recognition model, and therefore the correlation degree value between the text sequence and the explanatory document is output. And if the correlation degree value is larger than a preset threshold value, determining that the query text is related to the description document. Wherein, the text sequence is obtained by segmenting the query text. The preset threshold may be set according to a requirement, for example, 0.5, and this is not particularly limited in the embodiment of the present invention.
The process mainly comprises the steps of firstly calculating the correlation degree value between the text sequence and the explanatory document, and then determining whether the text sequence is correlated with the explanatory document or not based on the correlation degree value. Of course, it is also possible to directly input the text sequence into the second question recognition model and directly output the correlation result between the query text and the description document, that is, directly output whether the text sequence is correlated or uncorrelated with the description document. The second problem recognition model is obtained by training the sample text sequence and the correlation result between the sample text sequence and the explanatory document, and the process of obtaining the second problem recognition model can refer to the training mode of the first problem recognition model, which is not described herein again. It should be noted that, in any way, the function of identifying whether the query text and the description document are related to each other may be implemented by the question identification module under the product function guidance system.
According to the method provided by the embodiment of the invention, the text sequence is input into the first problem identification model, the correlation degree value between the text sequence and the explanatory document is output, and if the correlation degree value is greater than a preset threshold value, the correlation of the contents between the query text and the explanatory document is determined; alternatively, the text sequence is input to the second question recognition model, and the correlation result between the query text and the explanatory document is output. Because whether the query text of the user query is related to the description document or not can be determined, the query which is not related to the description document can be found in time, and the user is prompted to query in other modes. Therefore, besides guiding the user in product function, the user can also recognize and give a prompt for questions about other uses, and the user experience is improved.
Considering that there may be words in the query text that have different meanings but are equivalent to each other, such as "antilock brake system", "antilock" and "ABS", the four words have the same meaning but different contents. In order to facilitate the subsequent processing of the query text, some participles in the query text can be unified into preset target words with the same meaning. In view of the above situation, based on the content of the foregoing embodiment, as an alternative embodiment, an embodiment of the present invention further provides a method for querying unification of word segmentation in a text, including: if the word segmentation in the text sequence needs to be replaced, acquiring a target word corresponding to the word segmentation needing to be replaced, and updating the query text based on the target word; the word segmentation to be replaced is different from the target word in content and has the same meaning, and the text sequence is obtained by segmenting the query text.
Before the above process is performed, word segmentation dictionaries with the same meaning but different contents can be established in advance. Wherein, each group of the participles with the same meaning but different contents can form a record in the participle dictionary. For example, taking "anti-lock brake system", "anti-lock" and "ABS" as examples (all being words in the same field and meaning equivalent), the above four participles may be composed as the following records in a participle dictionary: { "anti-lock brake system" [ "anti-lock", "ABS" ] }. The term "antilock braking system" is a standard field term in the automotive field, i.e., a target term.
When the text sequence includes "anti-lock", "anti-lock" or "ABS", that is, when there is a word segmentation in the text sequence that needs to be replaced, the "anti-lock", "anti-lock" and "ABS" in the query text may be updated to "anti-lock braking system". Taking the query text as "what the antilock brake system has a role" as an example, the query text may become "what the antilock brake system has a role" through the above-described updating process. It should be noted that the function of updating the participles in the query text into target words may be implemented by the domain word processing module under the product function instruction system.
According to the method provided by the embodiment of the invention, when the word segmentation in the text sequence needs to be replaced, the target word corresponding to the word segmentation needing to be replaced is obtained, and the query text is updated based on the target word. The participles in the query text can be unified into preset target words with the same meaning, so that the query text can be uniformly processed in the follow-up process.
As can be seen from the above embodiments, the structured text may be composed of a plurality of fields. Consider that some fields, such as title text and body content, will be present in every structured text. Based on the content of the above embodiment, as an alternative embodiment, the structured text may at least include a title and a body content; accordingly, the embodiment of the present invention does not specifically limit the way of obtaining the matching probability between each structured text and the query text. Referring to fig. 5, the method specifically includes: 501. for any one structured text, acquiring a first similarity between the query text and a title in any one structured text and a second similarity between the query text and body content in any one structured text; 502. and fusing the basic features, the first similarity and the second similarity between the query text and any one of the structured texts to obtain the matching probability between any one of the structured texts and the query text.
The process of obtaining the matching probability in steps 501 to 502 can be completed by a probability calculation model. The probability calculation model can be divided into an input layer, a representation layer, a fusion layer and an output layer according to functional division. The input layer is used for inputting query text, a title and text content. For convenience of uniform processing, when the three parameters are input, the input layer may be the same as the keyword prediction model in the above embodiment, that is, in a supplementing or truncating manner, so as to ensure that each parameter is equal in length when the three parameters are input each time.
The presentation layer can respectively encode the query text, the title and the body content into corresponding vectors through a bidirectional recurrent neural network, and calculate the first similarity between the query text and the title according to the corresponding vectors of the query text and the title; and calculating a second similarity between the query text and the body content according to the vectors corresponding to the query text and the body content. To facilitate the description of the subsequent calculation process, the query text, the title and the body content are respectively encoded into corresponding vectors, which are respectively represented by Q, T, D, and a first similarity between the query text and the title can be represented by f1Indicating that a second similarity between the query text and the body content can be expressed as f2And (4) showing. Wherein the first similarity f1Degree of similarity to f2The types of (d) may all be cosine similarity.
The fusion layer is used for fusing the first similarity, the second similarity and the basic characteristics between the query text and the structured text, so as to obtain the matching probability between the structured text and the query text. Wherein the base features can be used to represent word and semantic level related features between the query text and the structured text. When the fusion layer fuses the first similarity, the second similarity and the basic features between the query text and the structured text, the first similarity, the second similarity and the basic features can be weighted and summed, and after a weighted and summed result is obtained, the weighted and summed result can be normalized through a Sigmoid function, so that the matching probability between the structured text and the query text is obtained. The output layer is used for outputting the matching probability between the structured text and the query text.
The structure of the probability calculation model can refer to fig. 6, and the Sigmoid function can refer to the following formula:
Figure GDA0001728141050000111
in the above formula, x is a weighted summation result obtained by performing weighted summation on the first similarity, the second similarity and the basic feature.
According to the method provided by the embodiment of the invention, for any structured text at least comprising the title and the body content, the first similarity between the query text and the title in the structured text and the second similarity between the query text and the body content in the structured text are obtained. And fusing the basic features, the first similarity and the second similarity between the query text and the structured text to obtain the matching probability between the structured text and the query text. The similarity between the titles of the query text and the structured text, the similarity between the body contents of the query text and the structured text and the basic characteristics between the query text and the structured text can be combined to calculate the matching probability, so that the determination result can be more accurate when the structured text matched with the query text is determined subsequently.
As can be seen from the above description of the embodiments, the base features can be used to represent word and semantic level related features between the query text and the structured text. Based on the content of the above embodiment, as an alternative embodiment, for any one of the structured texts, the basic feature includes at least one of the following three information, which are the matching score between the query text and the structured text, the weighted matching value between the query text and the structured text, and the vector similarity between the query text and the structured text.
Accordingly, the basic feature between the query text and the structured text can also be obtained before the basic feature, the first similarity and the second similarity between the query text and the structured text are fused. The embodiment of the present invention does not specifically limit the manner of obtaining the basic features between the query text and the structured text, and includes but is not limited to: calculating the matching score between each morpheme and the title in the query text, performing morpheme matching on the query text and the structured text according to the matching score corresponding to each morpheme, and taking the matching score between the query text and the structured text as one piece of information in basic characteristics; and/or the presence of a gas in the gas,
acquiring the weight of each participle in the query text, acquiring a weighted matching value between the query text and the structured text according to the weight of each participle and a matching result between each participle and a title, and taking the weighted matching value as one information in basic characteristics; and/or the presence of a gas in the gas,
acquiring a text vector corresponding to the query text according to the weight and the word vector of each participle in the query text, acquiring the vector similarity between the query text and the structured text according to the text vector and the title weighted vector corresponding to the title, and taking the vector similarity as one of the basic characteristics.
It should be noted that, the above-mentioned contents are processes for acquiring three kinds of information in the basic feature, and the three kinds of information acquisition processes in actual implementation may be correspondingly executed according to information actually included in the basic feature, which is not specifically limited in the embodiment of the present invention. For example, if the basic feature only includes the matching score and the weighted matching value, the process of obtaining the matching score and the weighted matching value is only required to be performed.
For the above process of obtaining the matching score, the embodiment of the present invention does not specifically limit the way of calculating the matching score between each morpheme and the title in the query text, and includes but is not limited to: for any morpheme in the query text, obtaining a matching result between a morpheme combination related to the morpheme and the title, wherein the morpheme combination is determined based on adjacent morphemes of the morpheme in the query text; and calculating the matching score between any morpheme and the title according to the matching result of each morpheme combination.
For example, the query text is "how to operate the vehicle air conditioner". For the morpheme "null" in the query text, the morpheme combinations related to the morpheme are respectively the combination of the "null" word itself, the "null" and the previous word, and the combination of the "null" and the previous two words is also the "null", "vehicle empty" or "vehicle empty". After determining the morpheme combinations related to the morpheme "null", the matching result between each morpheme combination and the title can be obtained. The matching result may be represented by 1 and 0, the matching success result is 1, and the matching failure result is 0.
Taking the title of "air conditioning and refrigeration" as an example, since the morpheme combination "empty" and the title can be completely matched, the matching result of "empty" can be determined to be 1. The morpheme group "vehicle empty" and the title cannot be completely matched, so that it can be determined that the matching result of "vehicle empty" is 0. The morpheme combination "car empty" and the title cannot be completely matched, so that it can be determined that the matching result of "car empty" is 0.
In calculating the matching score between the morpheme "null" and the title, it can be calculated by the following formula:
Figure GDA0001728141050000131
wherein, g1For the matching result between morpheme combination "null" and title, g2For the matching result between morpheme combination "car empty" and title, g3The matching result between the morpheme combination "car empty" and the title is obtained. As can be seen from the above, g1、g2And g3The values of (1) and (0) are respectively 1, 0 and 0, so that the score can be calculatedwIs 1/6, i.e. 0.16666. Similarly, the matching score between each morpheme and the title in the query text can be calculated according to the above method.
After the matching score between each morpheme and the title in the query text is obtained, all the matching scores can be summed, the sum result is divided by the number of the morphemes in the query text, and the obtained ratio is the matching score between the query text and the structured text corresponding to the title. For convenience of description of the subsequent calculation process, the matching score calculated by the above process can be used as f3Expressed and used as a piece of information in the underlying features.
For the above process of obtaining the weighted matching value, the embodiment of the present invention does not specifically limit the way of obtaining the weight of each participle in the query text, which includes but is not limited to: inputting the word segmentation sequence into a keyword prediction model, and outputting the weight of each word segmentation in the query text, wherein the keyword prediction model is obtained by training sample words and weights labeled corresponding to the sample words, and the text sequence is obtained by segmenting the query text. The content of the above embodiment can be referred to in the training process and the using process of the keyword prediction model, and the structure of the keyword prediction model can be referred to in fig. 3, which is not described herein again.
After the weight of each participle in the query text is obtained, the matching result between each participle and the title can be determined. The matching result may be represented by 1 and 0, the matching success result is 1, and the matching failure result is 0. Taking the query text as "navigation how to use" and the title "navigation use" as an example, the participles in the query text are "navigation", "how" and "use", respectively. Since the word segmentation "navigation" and the title "navigation use" are matched with each other, the matching result is 1. There is no match between the word segmentation "how" and the title "navigation usage", so that the matching result is 0. The word segmentation "use" is matched with the title "navigation use", so that the matching result is 1. If the weight of the participle "navigation" is 0.6, the weight of the participle "what" is 0.1, and the participle "is used"is weighted 0.3, so that the weighted match value between the query text and the structured text corresponding to the title is (0.6 × 1+0.1 × 0+0.3 × 1) ═ 0.9. For convenience of description of subsequent calculation processes, the weighted matching value calculated by the above process can be used as f4Expressed and used as a piece of information in the underlying features.
For the above process of obtaining the similarity of vectors, the word vector of each word segment in the query text may be obtained first. Specifically, a word vector dictionary of all words can be constructed in advance, and the word vector of each participle in the query text can be obtained by querying the word vector dictionary. In addition, the weight of each word segmentation can be obtained through a keyword prediction model, and the specific process can refer to the content of the above embodiment, which is not described herein again. Weighting and summing the word vector and the weight of each participle in the query text to obtain a text vector corresponding to the query text
Figure GDA0001728141050000151
After the text vector corresponding to the query text is obtained, the vector similarity between the query text and the structured text can be obtained based on the text vector and the title weighting vector corresponding to the title. The title weighting vector may be obtained by weighting a word segmentation vector of each word segmentation in the title and a weight corresponding to each word segmentation vector, and the vector similarity may be a cosine similarity between the title vector and the text vector.
For the convenience of subsequent calculation, after the cosine similarity between the title vector and the text vector is obtained through calculation, the cosine similarity value can be normalized to be between 0 and 1 through linear calculation, and thus the normalization result is used as the vector similarity between the query text and the structured text. To facilitate the description of the subsequent computation process, the vector similarity between the query text and the structured text can be used as f5Expressed and used as a piece of information in the underlying features. It should be noted that the function of extracting the basic features may be implemented by a key feature extraction module under the product function guidance system.
In the above embodiments, after obtaining the first similarity, the second similarity and the basic feature, the three methods can be performedAnd fusing to obtain the matching probability between the structured text and the query text. Specifically, the three kinds of information f described above are contained in the basic feature3、f4、f5For example, the first similarity f may be determined first1Second degree of similarity f2Three kinds of information f3、f4、f5The respective weights. Wherein f is1、f2、f3、f4、f5The respective weights may be represented by W ═ W1,w2,w3,w4,w5]And (4) performing representation. Will f is1、f2、f3、f4、f5And w1,w2,w3,w4,w5And carrying out weighted summation, and normalizing the weighted summation result through a Sigmoid function, so as to obtain the matching probability between the structured text and the query text.
The above weighted summation process can refer to the structural diagram of the probability calculation model in fig. 6. The probabilistic calculation model can be obtained by training a large amount of training query texts, training structured texts and matching results between the training query texts and the training structured texts. In addition, in the training process, various parameters in an input layer, a representation layer and a fusion layer in the probability calculation model can be continuously optimized until the various parameters reach the optimal state.
According to the method provided by the embodiment of the invention, various information used for expressing the association degree of words and semantics between the query text and the structured text is used as the basic characteristics, and the basic characteristics are used as the basis for subsequently determining the matching probability between the structured text and the query text, so that the determination result is more accurate when the structured text matched with the query text is subsequently determined.
It should be noted that, all the above-mentioned alternative embodiments may be combined arbitrarily to form alternative embodiments of the present invention, and are not described in detail herein.
Based on the content of the foregoing embodiments, an embodiment of the present invention provides an information acquisition apparatus for executing the information acquisition method provided in the foregoing method embodiment. Referring to fig. 7, the apparatus includes:
a first obtaining module 701, configured to obtain matching probabilities between each structured text and a query text, where the structured text is obtained by disassembling an explanatory document, and the structured text is used to describe information queried by the query text;
a selecting module 702, configured to sort the matching probabilities from large to small, and select a structural text corresponding to a preset number of matching probabilities as a structural text matched with the query text.
As an alternative embodiment, the apparatus further comprises:
the first judging module is used for inputting the text sequence into the first question recognition model, outputting a correlation degree value between the text sequence and the explanatory document, if the correlation degree value is larger than a preset threshold value, determining that the content between the query text and the explanatory document is correlated, wherein the text sequence is obtained after the query text is segmented, and the first question recognition model is obtained after a sample text sequence and the correlation degree value between the sample text sequence and the explanatory document are trained; or,
and the second judgment module is used for inputting the text sequence into a second question recognition model and outputting a correlation result between the query text and the explanatory document, the second question recognition model is obtained by training the sample text sequence and the correlation result between the sample text sequence and the explanatory document, and the text sequence is obtained by segmenting the query text.
As an alternative embodiment, the apparatus further comprises:
the updating module is used for acquiring a target word corresponding to the participle to be replaced when the participle in the text sequence needs to be replaced, and updating the query text based on the target word; wherein, the word to be replaced is different from the target word and has the same meaning, and the text sequence is obtained by dividing the query text
As an alternative embodiment, the structured text comprises at least a title and body content; accordingly, the first obtaining module 701 is configured to obtain, for any one of the structured texts, a first similarity between the query text and a title in any one of the structured texts, and a second similarity between the query text and body content in any one of the structured texts; and fusing the basic features, the first similarity and the second similarity between the query text and any one of the structured texts to obtain the matching probability between any one of the structured texts and the query text.
As an alternative embodiment, the basic feature includes at least one of the following three information, which are the matching score between the query text and any one of the structured texts, the weighted matching value between the query text and any one of the structured texts, and the vector similarity between the query text and any one of the structured texts.
As an alternative embodiment, the apparatus further comprises:
the matching module is used for calculating the matching score between each morpheme and the title in the query text, performing morpheme matching on the query text and any structured text according to the matching score corresponding to each morpheme, and taking the matching score between the query text and any structured text as one piece of information in basic characteristics; and/or the presence of a gas in the gas,
the second acquisition module is used for acquiring the weight of each participle in the query text, acquiring a weighted matching value between the query text and any structured text according to the weight of each participle and a matching result between each participle and a title, and taking the weighted matching value as one piece of information in basic characteristics; and/or the presence of a gas in the gas,
and the third acquisition module is used for acquiring a text vector corresponding to the query text according to the weight of each participle in the query text and the word vector, acquiring the vector similarity between the query text and any structured text according to the text vector and the title weighting vector corresponding to the title, and taking the vector similarity as one of the basic features.
As an optional embodiment, the second obtaining module is configured to input the word segmentation sequence into a keyword prediction model, and output a weight of each word segmentation in the query text, where the keyword prediction model is obtained by training sample words and weights labeled corresponding to the sample words, and the text sequence is obtained by performing word segmentation on the query text.
As an optional embodiment, the matching module is configured to, for any morpheme in the query text, obtain a matching result between a morpheme combination related to the any morpheme and the title, where the morpheme combination is determined based on adjacent morphemes of the any morpheme in the query text; and calculating the matching score between any morpheme and the title according to the matching result of each morpheme combination.
The device provided by the embodiment of the invention respectively obtains the matching probability between each structured text and the query text. And sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text. Because the description document can be disassembled to obtain the structured text, the structured text matched with the query text can be selected according to the matching probability between the query text and each structured text and used as a query result, and manual query is not needed, the efficiency of obtaining information is improved. In addition, the query basis is the description document associated with the product function, so that the accuracy and reliability of obtaining the message can be improved.
Secondly, outputting a correlation degree value between the text sequence and the explanatory document by inputting the text sequence into the first problem recognition model, and if the correlation degree value is larger than a preset threshold value, determining that the content between the query text and the explanatory document is related; alternatively, the text sequence is input to the second question recognition model, and the correlation result between the query text and the explanatory document is output. Because whether the query text of the user query is related to the description document or not can be determined, the query which is not related to the description document can be found in time, and the user is prompted to query in other modes. Therefore, besides guiding the user in product function, the user can also recognize and give a prompt for questions about other uses, and the user experience is improved.
And thirdly, for any structured text at least comprising a title and body content, acquiring a first similarity between the query text and the title in the structured text and a second similarity between the query text and the body content in the structured text. And fusing the basic features, the first similarity and the second similarity between the query text and the structured text to obtain the matching probability between the structured text and the query text. The similarity between the titles of the query text and the structured text, the similarity between the body contents of the query text and the structured text and the basic characteristics between the query text and the structured text can be combined to calculate the matching probability, so that the determination result can be more accurate when the structured text matched with the query text is determined subsequently.
In this way, by using various information for representing the word and semantic relevance between the query text and the structured text as the basic features and using the basic features as the basis for subsequently determining the matching probability between the structured text and the query text, the determination result in the subsequent determination of the structured text matching with the query text can be more accurate.
The embodiment of the invention provides information acquisition equipment. Referring to fig. 8, the apparatus includes: a processor (processor)801, a memory (memory)802, and a bus 803;
the processor 801 and the memory 802 communicate with each other via a bus 803;
the processor 801 is configured to call the program instructions in the memory 802 to execute the information obtaining method provided by the foregoing embodiments, for example, including: respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the explanatory document and is used for describing the information queried by the query text; and sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text.
An embodiment of the present invention provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions enable a computer to execute the information acquisition method provided in the foregoing embodiment, for example, the method includes: respectively acquiring the matching probability between each structured text and the query text, wherein the structured text is obtained by disassembling the explanatory document and is used for describing the information queried by the query text; and sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the information acquiring apparatus and the like are merely illustrative, and units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the various embodiments or some parts of the methods of the embodiments.
Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the embodiments of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims (10)

1. An information acquisition method, comprising:
respectively obtaining the matching probability between each structured text and a query text, wherein the structured text is obtained by disassembling an explanatory document, the structured text is used for describing information queried by the query text, the structured text at least comprises a title and a body content, and each structured text corresponds to a function with the minimum function granularity in the explanatory document;
sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text;
the respectively obtaining the matching probability between each structured text and the query text comprises:
for any one structured text, acquiring a first similarity between the query text and a title in the any one structured text and a second similarity between the query text and body content in the any one structured text;
and fusing the basic features, the first similarity and the second similarity between the query text and any one of the structured texts to obtain the matching probability between any one of the structured texts and the query text.
2. The method of claim 1, wherein before obtaining the match probability between each structured document and the query document, further comprising:
inputting a text sequence into a first question recognition model, outputting a correlation degree value between the text sequence and the explanatory document, and if the correlation degree value is greater than a preset threshold value, determining that the content between the query text and the explanatory document is correlated, wherein the text sequence is obtained by segmenting the query text, and the first question recognition model is obtained by training a sample text sequence and the correlation degree value between the sample text sequence and the explanatory document; or,
inputting the text sequence into a second question recognition model, and outputting a correlation result between the query text and the explanatory document, wherein the second question recognition model is obtained by training a sample text sequence and a correlation result between the sample text sequence and the explanatory document, and the text sequence is obtained by segmenting the query text.
3. The method of claim 1, wherein before obtaining the match probability between each structured document and the query document, further comprising:
if the participles in the text sequence need to be replaced, acquiring target words corresponding to the participles needing to be replaced, and updating the query text based on the target words; the word segmentation to be replaced is different from the content of the target word and has the same meaning, and the text sequence is obtained by segmenting the query text.
4. The method of claim 1, wherein the base features comprise at least one of a match score between the query text and any of the structured texts, a weighted match value between the query text and any of the structured texts, and a vector similarity between the query text and any of the structured texts.
5. The method of claim 4, wherein before fusing the base features, the first similarity, and the second similarity between the query text and any of the structured texts, further comprising:
calculating a matching score between each morpheme in the query text and the title, performing morpheme matching on the query text and any one of the structured texts according to the matching score corresponding to each morpheme, and taking the matching score between the query text and any one of the structured texts as one of the basic features; and/or the presence of a gas in the gas,
acquiring the weight of each participle in the query text, acquiring a weighted matching value between the query text and any one of the structured texts according to the weight of each participle and a matching result between each participle and the title, and taking the weighted matching value as one of the basic features; and/or the presence of a gas in the gas,
acquiring a text vector corresponding to the query text according to the weight and the word vector of each participle in the query text, acquiring the vector similarity between the query text and any one of the structured texts according to the text vector and the title weighted vector corresponding to the title, and taking the vector similarity as one of the basic features.
6. The method of claim 5, wherein obtaining the weight of each participle in the query text comprises:
inputting a text sequence into a keyword prediction model, and outputting the weight of each participle in the query text, wherein the keyword prediction model is obtained by training sample participles and weights labeled corresponding to the sample participles, and the text sequence is obtained by participling the query text.
7. The method of claim 5, wherein said calculating a match score between each morpheme in the query text and the headline comprises:
for any morpheme in a query text, obtaining a matching result between a morpheme combination related to the morpheme and the title, wherein the morpheme combination is determined based on adjacent morphemes of the morpheme in the query text;
and calculating the matching score between any morpheme and the title according to the matching result of each morpheme combination.
8. An information acquisition apparatus characterized by comprising:
the first acquisition module is used for respectively acquiring the matching probability between each structured text and a query text, wherein the structured text is obtained by disassembling an explanatory document and is used for describing information queried by the query text, the structured text at least comprises a title and a body content, and each structured text corresponds to a function with the minimum function granularity in the explanatory document;
the selecting module is used for sequencing the matching probabilities from large to small, and selecting the structured texts corresponding to the matching probabilities in the preset number as the structured texts matched with the query text;
the first acquisition module is used for acquiring a first similarity between the query text and a title in any structured text and a second similarity between the query text and body content in any structured text for any structured text; and fusing the basic features, the first similarity and the second similarity between the query text and any one of the structured texts to obtain the matching probability between any one of the structured texts and the query text.
9. An information acquisition apparatus characterized by comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.
CN201810550859.8A 2018-05-31 2018-05-31 Information acquisition method and device Active CN108959387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810550859.8A CN108959387B (en) 2018-05-31 2018-05-31 Information acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810550859.8A CN108959387B (en) 2018-05-31 2018-05-31 Information acquisition method and device

Publications (2)

Publication Number Publication Date
CN108959387A CN108959387A (en) 2018-12-07
CN108959387B true CN108959387B (en) 2020-09-11

Family

ID=64492772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810550859.8A Active CN108959387B (en) 2018-05-31 2018-05-31 Information acquisition method and device

Country Status (1)

Country Link
CN (1) CN108959387B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464656B (en) * 2020-11-30 2024-02-13 中国科学技术大学 Keyword extraction method, keyword extraction device, electronic equipment and storage medium
CN113204630A (en) * 2021-05-31 2021-08-03 平安科技(深圳)有限公司 Text matching method and device, computer equipment and readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006195667A (en) * 2005-01-12 2006-07-27 Toshiba Corp Structured document search device, structured document search method and structured document search program
CN101566998B (en) * 2009-05-26 2011-12-28 华中师范大学 Chinese question-answering system based on neural network
CN106844587B (en) * 2017-01-11 2019-11-08 北京光年无限科技有限公司 It is a kind of for talking with the data processing method and device of interactive system
CN107688608A (en) * 2017-07-28 2018-02-13 合肥美的智能科技有限公司 Intelligent sound answering method, device, computer equipment and readable storage medium storing program for executing
CN107679224B (en) * 2017-10-20 2020-09-08 竹间智能科技(上海)有限公司 Intelligent question and answer method and system for unstructured text

Also Published As

Publication number Publication date
CN108959387A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
US10922322B2 (en) Systems and methods for speech-based searching of content repositories
CN108733778B (en) Industry type identification method and device of object
CN112800170A (en) Question matching method and device and question reply method and device
CN110543592B (en) Information searching method and device and computer equipment
US9449075B2 (en) Guided search based on query model
CN110502738A (en) Chinese name entity recognition method, device, equipment and inquiry system
US8788503B1 (en) Content identification
CN111832305B (en) User intention recognition method, device, server and medium
CN109815318A (en) The problems in question answering system answer querying method, system and computer equipment
US11416534B2 (en) Classification of electronic documents
CN113569011B (en) Training method, device and equipment of text matching model and storage medium
CN115440221B (en) Vehicle-mounted intelligent voice interaction method and system based on cloud computing
CN108959387B (en) Information acquisition method and device
KR101472451B1 (en) System and Method for Managing Digital Contents
CN111881283A (en) Business keyword library creating method, intelligent chat guiding method and device
CN112115709A (en) Entity identification method, entity identification device, storage medium and electronic equipment
CN113177061B (en) Searching method and device and electronic equipment
CN117194647B (en) Intelligent question-answering system, method and device for offline environment
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN112417174A (en) Data processing method and device
CN115827990B (en) Searching method and device
CN112541051A (en) Standard text matching method and device, storage medium and electronic equipment
CN114661892A (en) Manuscript abstract generation method and device, equipment and storage medium
CN113449094A (en) Corpus obtaining method and device, electronic equipment and storage medium
CN110008307B (en) Method and device for identifying deformed entity based on rules and statistical learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant